Semantic Similarity Functions

Introduction

The go3 library provides several semantic similarity functions for comparing Gene Ontology (GO) terms. These measures rely on two main principles:

  • Information Content (IC) derived from GO annotations.

  • Graph-based topological relationships in the GO hierarchy.

Available Similarity Methods

Similarity Methods for the `method` Parameter

Method Name

String for method

Description

Resnik

resnik

Information content of the most informative common ancestor (MICA)

Lin

lin

Normalized Resnik similarity

Jiang-Conrath

jc

Inverse of Jiang-Conrath distance

SimRel

simrel

Lin similarity with exponential relevance factor

Information Coefficient

iccoef

Normalized by minimum IC of the two terms

GraphIC

graphic

IC divided by maximum graph depth

Wang

wang

Graph-based semantic similarity (Wang et al.)

TopoICSim

topoicsim

Topological and IC-based hybrid similarity

You can use these strings as the method parameter in all go3 similarity functions:

sim = go3.semantic_similarity("GO:0006397", "GO:0008380", "lin", counter)
sim = go3.semantic_similarity("GO:0006397", "GO:0008380", "topoicsim", counter)

Similarity Measures

Resnik Similarity

The Resnik similarity [1] measures the similarity between two GO terms as the information content (IC) of their Most Informative Common Ancestor (MICA):

\[\mathrm{Sim}_{Resnik}(t_1, t_2) = IC(\mathrm{MICA}(t_1, t_2))\]

Lin Similarity

Lin’s similarity [2] normalizes Resnik’s similarity by the sum of the ICs of both terms:

\[\mathrm{Sim}_{Lin}(t_1, t_2) = \frac{2 \times IC(\mathrm{MICA}(t_1, t_2))}{IC(t_1) + IC(t_2)}\]

Jiang-Conrath Similarity

Jiang and Conrath define a distance between two GO terms based on IC [3]:

\[d_{JC} = IC(t_1) + IC(t_2) - 2 \times IC(\mathrm{MICA})\]

Similarity is then calculated as:

\[\mathrm{Sim}_{JC} = \frac{1}{d_{JC}}\]

SimRel Similarity

The SimRel measure [4] combines Lin’s similarity with an exponential relevance factor:

\[\mathrm{Sim}_{Rel} = \left( \frac{2 \times IC(\mathrm{MICA})}{IC(t_1) + IC(t_2)} \right) \times \left(1 - e^{-IC(\mathrm{MICA})}\right)\]

Information Coefficient

Li et al. [5] propose a normalization using the minimum IC of the two terms:

\[\mathrm{IC\_coef} = \frac{IC(\mathrm{MICA})}{\min(IC(t_1), IC(t_2))}\]

GraphIC Similarity

The GraphIC measure uses the maximum graph depth of the two terms to scale the similarity:

\[\mathrm{GraphIC} = \frac{IC(\mathrm{MICA})}{\max(\mathrm{depth}(t_1), \mathrm{depth}(t_2)) + 1}\]

Wang Similarity

The Wang similarity [6] considers the graph structure of GO by propagating weights from each term through its ancestors.

Each ancestor node receives a weight based on the decay factor (usually \(w = 0.8\)). The similarity is computed as:

\[\mathrm{Sim}_{Wang}(t_1, t_2) = \frac{ \sum_{x \in A(t_1) \cap A(t_2)} \left( S_{t_1}(x) + S_{t_2}(x) \right) }{ SV(t_1) + SV(t_2) }\]

where

  • \(A(t)\) is the set of ancestors of term \(t\) (including itself),

  • \(S_t(x)\) is the semantic contribution of ancestor \(x\) to term \(t\),

  • \(SV(t)\) is the total semantic value of term \(t\).

The key idea is that ancestors closer to the term contribute more to its meaning than distant ancestors, capturing the hierarchical semantics of the ontology without relying on external annotation statistics.

TopoICSim Similarity

The TopoICSim similarity [7] is a hybrid measure that combines information content and the topology of the GO graph. It is defined as:

\[\mathrm{Sim}_{TopoICSim}(t_1, t_2) = 1 - \frac{2}{\pi} \arctan \left( \min_{x \in DCA(t_1, t_2)} \frac{wSP(t_1, x) + wSP(t_2, x)}{wLP(x, r)} \right)\]

where

  • \(DCA(t_1, t_2)\) is the set of disjunctive common ancestors of \(t_1\) and \(t_2\),

  • \(wSP(t, x)\) is the weighted shortest path (sum of inverse ICs) from \(t\) to ancestor \(x\),

  • \(wLP(x, r)\) is the weighted longest path from \(x\) to a root \(r\) in the ontology.

This measure captures both the specificity of the common ancestors and the topological distance between terms, providing a robust similarity score.

Batch Computation

All these similarity measures are available in efficient batch versions in the go3 library, taking full advantage of Rust’s parallelism.

Bibliography

[1]

P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), 448–453. 1995.

[2]

D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML), 296–304. 1998.

[3]

J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics (ROCLING), 19–33. 1997.

[4]

A. Schlicker, F. S. Domingues, J. Rahnenfuhrer, and T. Lengauer. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 7(1):302, 2006.

[5]

B. Li, J. Z. Wang, F. A. Feltus, J. Zhou, and F. Luo. Effectively integrating information content and structural relationship to improve the go-based similarity measure. Journal of Biomedical Informatics, 43(5):752–760, 2010.

[6]

J. Z. Wang, Z. Du, R. Payattakool, S. Y. Philip, and C. F. Chen. A new method to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274–1281, 2007.

[7]

R. Ehsani and F. Drabløs. Topoicsim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics, 17:296, 2016. doi:10.1186/s12859-016-1160-0.

batch_similarity(list1, list2, method, counter)

Compute pairwise semantic similarity in batch using a selected method.

Parameters:
  • list1 (list of str) – First list of GO term IDs.

  • list2 (list of str) – Second list of GO term IDs.

  • method (str) – Name of the similarity method.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If input lists differ in length or method is unknown.

compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes in batches.

Parameters:
  • pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

List of similarity scores.

Return type:

list of float

Raises:

ValueError – If method or combine are unknown.

compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)

Compute semantic similarity between genes.

Parameters:
  • gene1 (str) – Gene symbol of the first gene.

  • gene2 (str) – Gene symbol of the second gene.

  • ontology (str) – Name of the subontology of GO to use: BP, MF or CC.

  • similarity (str) – Name of the similarity method.

  • groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If method or combine are unknown.

semantic_similarity(id1, id2, method, counter)

Compute semantic similarity between two GO terms using a selected method.

Parameters:
  • id1 (str) – First GO term ID.

  • id2 (str) – Second GO term ID.

  • method (str) – Name of the similarity method. Options: “resnik”, “lin”, etc.

  • counter (TermCounter) – Precomputed IC values.

Returns:

Similarity score.

Return type:

float

Raises:

ValueError – If the method is unknown.

term_ic(go_id, counter)

Compute the Information Content (IC) of a GO term.

Parameters:
  • go_id (str) – GO term identifier.

  • counter (TermCounter) – Precomputed term counter with IC values.

Returns:

The IC of the GO term.

Return type:

float