IntervalTree-class          package:IRanges          R Documentation

_I_n_t_e_r_v_a_l _S_e_a_r_c_h _T_r_e_e_s

_D_e_s_c_r_i_p_t_i_o_n:

     An 'IntervalTree' object is an external representation of ranges
     (i.e. it is derived from 'XRanges') that is optimized for overlap
     queries.

_D_e_t_a_i_l_s:

     A common type of query that arises when working with intervals is
     finding which intervals in one set overlap those in another. An
     efficient family of algorithms for answering such queries is known
     as the Interval Tree. This class makes use of the augmented tree
     algorithm from the reference below, but heavily adapts it for the
     use case of large, sorted query sets.

     The usual workflow is to create an 'IntervalTree' using the
     constructor described below and then perform overlap queries using
     the 'overlap' method. The results of the query are returned as a
     'RangesMatching' object.

_C_o_n_s_t_r_u_c_t_o_r:


      IntervalTree(ranges): Creates an 'IntervalTree' from the ranges
          in 'ranges', an object coercible to 'IntervalTree', such as
          an 'IRanges' object.


_F_i_n_d_i_n_g _O_v_e_r_l_a_p_s:

     This main purpose of the interval tree is to optimize the search
     for ranges overlapping those in a query set. The interface for
     this operation is the 'overlap' function.


      'overlap(object, query = object, maxgap = 0, multiple = TRUE)':

          Find the intervals in 'query', a 'Ranges' or 'integer' vector
          to be converted to length-one ranges, that overlap with the
          intervals 'object', an 'IntervalTree' or, for convenience, a
          'Ranges' coercible to a 'IntervalTree'. If 'query' is
          omitted, 'object' is queried against itself. if 'query' is
          unsorted, it is sorted first, so it is usually better to sort
          up-front, to avoid a sort with each 'overlap' call. Intervals
          with a separation of 'maxgap' or less are considered to be
          overlapping. 'maxgap' should be a scalar, non-negative,
          non-NA number. When 'multiple' (a scalar non-NA logical) is
          'TRUE', the results are returned as a 'RangesMatching'
          object.

          If 'multiple' is 'FALSE', at most one overlapping interval in
          'object' is returned for each interval in 'query'. The
          matchings are returned as an integer vector of length
          'length(query)', with 'NA' indicating intervals that did not
          overlap any intervals in 'object'. This is analogous to the
          return value of the 'match' function.

          'query' may also be a 'RangesList', in which case 'object'
          must also be a 'RangesList'. If both lists have names, each
          element from the subject is paired with the element from the
          query with the matching name, if any. Otherwise, elements are
          paired by position. The overlap is then computed between the
          pairs as described above. If 'multiple' is 'TRUE', a
          'RangesMatchingList' is returned, otherwise a list of integer
          vectors. Each element of the result corresponds to a space in
          'query'. For spaces that did not exist in 'object', the
          overlap is nil.

      'x %in% table': Shortcut for finding the ranges in 'x' that
          overlap any of the ranges in 'table'. Both 'x' and 'table'
          should be 'Ranges' objects. The result is a 'logical' vector
          of the same length as 'x'.


_C_o_e_r_c_i_o_n:


      'as(from, "IRanges")': Imports the ranges in 'from', an
          'IntervalTree', to an 'IRanges'.

      'as(from, "IntervalTree")': Constructs an 'IntervalTree'
          representing 'from', a 'Ranges' object that is coercible to
          'IRanges'.


_A_c_c_e_s_s_o_r_s:


      'length(x)': Gets the number of ranges stored in the tree. This
          is a fast operation that does not bring the ranges into R.


_N_o_t_e_s _o_n _T_i_m_e _C_o_m_p_l_e_x_i_t_y:

     The cost of constructing an instance of the interval tree is a
     'O(n*lg(n))', which makes it about as fast as other types of
     overlap query algorithms based on sorting. The good news is that
     the tree need only be built once per subject; this is useful in
     situations of frequent querying. Also, in this implementation the
     data is stored outside of R, avoiding needless copying. Of course,
     external storage is not always convenient, so it is possible to
     coerce the tree to an instance of 'IRanges' (see the Coercion
     section).

     For the query operation, the running time is based on the query
     size 'm' and the average number of hits per query 'k'. The output
     size is then 'max(mk,m)', but we abbreviate this as 'mk'. Note
     that when the 'multiple' parameter is set to 'FALSE', 'k' is fixed
     to 1 and drops out of this analysis. We also assume here that the
     query is sorted by start position (the 'overlap' function sorts
     the query if it is unsorted).

     An upper bound for finding overlaps is 'O(min(mk*lg(n),n+mk))'.
     The fastest interval tree algorithm known is bounded by
     'O(min(m*lg(n),n)+mk)' but is a lot more complicated and involves
     two auxillary trees. The lower bound is 'Omega(lg(n)+mk)', which
     is almost the same as for returning the answer, 'Omega(mk)'. The
     average is of course somewhere in between.

     This analysis informs the choice of which set of ranges to process
     into a tree, i.e. assigning one to be the subject and the other to
     be the query. Note that if 'm > n', then the running time is
     'O(m)', and the total operation of complexity 'O(n*lg(n) + m)' is
     better than if 'm' and 'n' were exchanged. Thus, for once-off
     operations, it is often most efficient to choose the smaller set
     to become the tree (but 'k' also affects this). This is reinforced
     by the realization that if 'mk' is about the same in either
     direction, the running time depends only on 'n', which should be
     minimized. Even in cases where a tree has already been constructed
     for one of the sets, it can be more efficient to build a new tree
     when the existing tree of size 'n' is much larger than the query
     set of size 'm', roughly when 'n > m*lg(n)'.

_A_u_t_h_o_r(_s):

     Michael Lawrence

_R_e_f_e_r_e_n_c_e_s:

     Interval tree algorithm from: Cormen, Thomas H.; Leiserson,
     Charles E.; Rivest, Ronald L.; Stein, Clifford. Introduction to
     Algorithms, second edition, MIT Press and McGraw-Hill. ISBN
     0-262-53196-8

_S_e_e _A_l_s_o:

     'XRanges', the parent of this class, 'RangesMatching', the result
     of an overlap query.

_E_x_a_m_p_l_e_s:

       query <- IRanges(c(1, 4, 9), c(5, 7, 10))
       subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
       tree <- IntervalTree(subject)

       ## at most one hit per query
       overlap(tree, query, multiple = FALSE) # c(2, NA, 3)

       ## allow multiple hits
       overlap(tree, query)

       ## overlap as long as distance <= 1
       overlap(tree, query, maxgap = 1)

       ## shortcut
       overlap(subject, query)

       ## query and subject are easily interchangeable
       query <- IRanges(c(1, 4, 9), c(5, 7, 10))
       subject <- IRanges(c(2, 2), c(5, 4))
       tree <- IntervalTree(subject)
       t(overlap(tree, query))
       # the same as:
       overlap(query, subject)

       ## one Ranges with itself
       overlap(query)

       ## single points as query
       subject <- IRanges(c(1, 6, 13), c(4, 9, 14))
       overlap(subject, c(3L, 7L, 10L), multiple=FALSE)

