readFastq             package:ShortRead             R Documentation

_R_e_a_d _F_A_S_T_Q-_f_o_r_m_a_t_t_e_d _f_i_l_e_s _i_n_t_o _c_o_m_p_a_c_t _R _r_e_p_r_e_s_e_n_t_a_t_i_o_n_s

_D_e_s_c_r_i_p_t_i_o_n:

     'readFastq' reads all FASTQ-formated files in a directory
     'dirPath' whose file name matches pattern 'pattern', returning a
     compact internal representation of the sequences and quality
     scores in the files. Methods read all files into a single R
     object; a typical use is to restrict input to a single FASTQ file.

_U_s_a_g_e:

     readFastq(dirPath, pattern=character(0), ...)

_A_r_g_u_m_e_n_t_s:

 dirPath: A character vector (or other object; see methods defined on
          this generic) giving the directory path (relative or
          absolute) of FASTQ files to be read.

 pattern: The ('grep'-style) pattern describing file names to be read.
          The default ('character(0)') results in line (attempted)
          input of all files in the directory.

     ...: Additional arguments, perhaps used by methods.

_D_e_t_a_i_l_s:

     The fastq format is not quite precisely defined. The basic
     definition used here parses the following four lines as a single
     record:


         @HWI-EAS88_1_1_1_1001_499
         GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT
         +HWI-EAS88_1_1_1_1001_499
         ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS

     The first and third lines are identifiers preceded by a specific
     character (the identifiers are identical, in the case of Solexa).
     The second line is an upper-case sequence of nucleotides. The
     parser recognizes IUPAC-standard alphabet (hence ambiguous
     nucleotides), coercing '.' to '-' to represent missing values. The
     final line is an ASCII-encoded representation of quality scores,
     with one ASCII character per nucleotide.

     The encoding implicit in Solexa-derived fastq files is that each
     character code corresponds to a score equal to the ASCII character
     value minus 64 (e.g., ASCII '@' is decimal 64, and corresponds to
     a Solexa quality score of 0). This is different from BioPerl, for
     instance, which recovers quality scores by subtracting 33 from the
     ASCII character value (so that, for instance, '!', with decimal
     value 33, encodes value 0).

     The BioPerl description of fastq asserts that the first character
     of line 4 is a '!', but the current parser does not support this
     convention.

_V_a_l_u_e:

     A single R object (e.g., 'ShortReadQ') containing sequences and
     qualities contained in all files in 'dirPath' matching 'pattern'.
     There is no guarantee of order in which files are read.

_A_u_t_h_o_r(_s):

     Martin Morgan

_S_e_e _A_l_s_o:

     The IUPAC alphabet in Biostrings.

     <URL: http://www.bioperl.org/wiki/FASTQ_sequence_format> for the
     BioPerl definition of fastq.

     Solexa documentation `Data analysis - documentation : Pipeline
     output and visualisation'.

_E_x_a_m_p_l_e_s:

     showMethods("readFastq")

     sp <- SolexaPath(system.file('extdata', package='ShortRead'))
     rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
     sread(rfq)
     id(rfq)
     quality(rfq)

     ## SolexaPath method 'knows' where FASTQ files are placed
     rfq1 <- readFastq(sp, pattern="s_1_sequence.txt")
     rfq1

