readFASTA             package:Biostrings             R Documentation

_F_u_n_c_t_i_o_n_s _t_o _r_e_a_d/_w_r_i_t_e _F_A_S_T_A _f_o_r_m_a_t_t_e_d _f_i_l_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     FASTA is a simple file format for biological sequence data. A file
     may contain one or more sequences, for each sequence there is a
     description  line which begins with a '>'.

_U_s_a_g_e:

       fasta.info(file, use.descs=TRUE)
       readFASTA(file, checkComments=TRUE, strip.descs=TRUE)
       writeFASTA(x, file="", append=FALSE, width=80)

_A_r_g_u_m_e_n_t_s:

    file: Either a character string naming a file or a connection. If
          '""' (the default for 'writeFASTA'), then the function writes
          to the standard output connection (the console) unless
          redirected by 'sink'. 

use.descs: 'TRUE' or 'FALSE'. Whether or not the description lines
          should be used to name the elements of the returned integer
          vector.

checkComments: Whether or not comments, lines beginning with a
          semi-colon should be found and removed. 

strip.descs: Whether or not the ">" marking the beginning of the
          description lines should be removed. Note that this argument
          is new in Biostrings >= 2.8. In previous versions 'readFASTA'
          was keeping the ">". 

       x: A list as one returned by 'readFASTA'. 

  append: 'TRUE' or 'FALSE'. If 'TRUE' output will be appended to
          'file'; otherwise, it will overwrite the contents of 'file'.
          See '?cat' for the details. 

   width: The maximum number of letters per line of sequence. 

_D_e_t_a_i_l_s:

     FASTA is a widely used format in biology. It is a relatively
     simple markup. I am not aware of a standard. It might be nice to
     check to see if the  data that were parsed are sequences of some
     appropriate type, but without a standard that does not seem
     possible.

     There are many other packages that provide similar, but different 
     capabilities.  The one in the package seqinr seems most similar
     but they separate the biological sequence into single character
     strings, which is too inefficient for large problems.

_V_a_l_u_e:

     An integer vector (for 'fasta.info') or a list (for 'readFASTA')
     with one element for each sequence in the file. For 'readFASTA',
     the elements are in two parts, one the description and the second
     a character string of the biological sequence.

_A_u_t_h_o_r(_s):

     R. Gentleman, H. Pages

_S_e_e _A_l_s_o:

     'read.BStringSet', 'read.DNAStringSet', 'read.RNAStringSet',
     'read.AAStringSet', 'write.XStringSet', 'read.table', 'scan',
     'write.table'

_E_x_a_m_p_l_e_s:

       f1 <- system.file("extdata", "someORF.fa", package="Biostrings")
       fasta.info(f1)
       ff <- readFASTA(f1, strip.descs=TRUE)
       desc <- sapply(ff, function(x) x$desc)
       ## Keep the "reverse complement" sequences only
       ff2 <- ff[grep("reverse complement", desc, fixed=TRUE)]
       writeFASTA(ff2, file.path(tempdir(), "someORF2.fa"))

