semsim               package:ontoTools               R Documentation

_C_o_m_p_u_t_e _s_e_m_a_n_t_i_c _s_i_m_i_l_a_r_i_t_y _m_e_a_s_u_r_e _f_o_r _t_e_r_m_s _i_n
_a_n _o_b_j_e_c_t-_o_n_t_o_l_o_g_y _c_o_m_p_l_e_x

_D_e_s_c_r_i_p_t_i_o_n:

     Compute semantic similarity measure for terms in an
     object-ontology complex

_U_s_a_g_e:

     semsim(c1, c2, ooc, acc=NULL, pc=NULL)
     conceptProbs(ooc,acc=NULL,inds=NULL) 
     subsumers(c1, c2, ont, acc=NULL) 
     pms(c1, c2, ooc, acc=NULL, pc=NULL) 
     usageCount(map,acc,inds)

_A_r_g_u_m_e_n_t_s:

      c1: c1, c2: "character" terms to be compared

      c2: c1, c2: "character" terms to be compared

     ooc: ooc: an object of class "OOC": object-ontology complex

     ont: ont: an object of class "ontology": annotated rooted DAG

     acc: acc: optional (sparse) accessibility matrix for the ontology

      pc: pc: optional vector of concept probabilities, if pre-computed

     map: map: OOmap component of an ooc

    inds: inds: vector of numeric indices, row indices of
          object-ontology map to be processed

_D_e_t_a_i_l_s:

     For large ontologies, computation of the term accessibility
     relationships and term probabilities can be costly. Once these are
     computed to support one semsim calculation, they should be saved. 
     The acc and pc parameters allow use of this saved information.

_V_a_l_u_e:

     semsim returns the measure of semantic similarity cited by Lord et
     al (2003).

_A_u_t_h_o_r(_s):

     Vince Carey <stvjc@channing.harvard.edu>

_R_e_f_e_r_e_n_c_e_s:

     PW Lord et al, Bioinformatics, 19(10)2003:1275

_E_x_a_m_p_l_e_s:

     #
     # we are given a graph of GOMF and the OOmap between LL and GOMF
     # derived from humanLLMappings and stored as data resources in
     # ontoTools -- these will have to be updated regularly
     #
     data(goMFgraph.1.10)
     data(LL2GOMFooMap.1.10)
     #
     # build the rooted DAG, the ontology, and the OOC objects
     #
     gomfrDAG <- new("rootedDAG", root="GO:0003674", DAG=goMFgraph.1.10)
     GOMFonto <- new("ontology", name="GOMF", version="bioc GO 1.10", rDAG=gomfrDAG)
     LLGOMFOOC <- makeOOC(GOMFonto, LL2GOMFooMap.1.10)
     #
     # we are given the accessibility matrix for the GO MF graph as a 
     # data resource, and we can compute some term probabilities
     #
     data(goMFamat.1.10)
     pc <- conceptProbs(LLGOMFOOC, goMFamat.1.10, inds=1:20)
     #
     # now we will get a sample of GO MF terms and compute the
     # semantic similarities of pairs of terms in the sample
     #
     data(LL2GOMFcp.1.10) # full set of precomputed concept probabilities
     library(GO)
     library(Biobase)
     library(combinat)
      library(annotate)
     GO() # get the GO environments
     GOtags <- ls(env=GOTERM)
     GOlabs <- mget(GOtags, env=GOTERM, ifnotfound=NA)
     GOMFtags <- GOtags[ sapply(GOlabs,Ontology)=="MF" ]
     GOMFtags <- GOMFtags[!is.na(GOMFtags)]
     GOMFtermObs <- mget(GOMFtags,env=GOTERM)
     GOMFterms <- sapply( GOMFtermObs, Term )
     ntags <- length(GOMFtags)
     if (any(duplicated(GOMFterms)))
      {
      dups <- (1:ntags)[duplicated(GOMFterms)]
      GOMFterms[dups] <- paste(GOMFterms[dups],".2",sep="")
      }
     #names(GOMFterms) <- GOMFtags
     set.seed(1234)
     # does not lead to common samples across platforms...
     st <- sample(names(GOMFterms),size=50) # take the sample
     st <- intersect(st, names(LL2GOMFcp.1.10))[1:10] # use only those terms available in GO 1.10
     # thus ...
     st = c("GO:0004397", "GO:0030215", "GO:0042802", "GO:0008504", "GO:0008640", 
     "GO:0008528", "GO:0008375", "GO:0005436", "GO:0004756", "GO:0003729"
     )
     pst <- combn(st,2)   # get a matrix with the pairs of terms in columns
     npst <- ncol(pst)
     ss <- rep(NA,npst)
     for (i in 1:npst)  # compute semantic similarities
       {
       cat(i)
       ss[i] <- semsim( pst[1,i], pst[2,i], ooc=LLGOMFOOC, acc=goMFamat.1.10, pc=LL2GOMFcp.1.10 )
       }
     print(summary(ss))
     top <- (1:npst)[ss==max(ss,na.rm=TRUE)][1]  # index of the most similar pair
                  # note -- must come to an understanding of the NAs
     print( GOMFterms[ as.character(pst[,top]) ] )
     pen <- (1:npst)[ss==max(ss[-top],na.rm=TRUE)][1] # second most similar
     print( GOMFterms[ as.character(pst[,pen]) ] )

