Semantic Similarity Functions
Introduction
The go3 library provides several semantic similarity functions for comparing Gene Ontology (GO) terms. These measures rely on two main principles:
Information Content (IC) derived from GO annotations.
Graph-based topological relationships in the GO hierarchy.
Available Similarity Methods
Method Name |
String for |
Description |
|---|---|---|
Resnik |
|
Information content of the most informative common ancestor (MICA) |
Lin |
|
Normalized Resnik similarity |
Jiang-Conrath |
|
Inverse of Jiang-Conrath distance |
SimRel |
|
Lin similarity with exponential relevance factor |
Information Coefficient |
|
Normalized by minimum IC of the two terms |
GraphIC |
|
IC divided by maximum graph depth |
Wang |
|
Graph-based semantic similarity (Wang et al.) |
TopoICSim |
|
Topological and IC-based hybrid similarity |
You can use these strings as the method parameter in all go3 similarity functions:
sim = go3.semantic_similarity("GO:0006397", "GO:0008380", "lin", counter)
sim = go3.semantic_similarity("GO:0006397", "GO:0008380", "topoicsim", counter)
Similarity Measures
Resnik Similarity
The Resnik similarity [1] measures the similarity between two GO terms as the information content (IC) of their Most Informative Common Ancestor (MICA):
Lin Similarity
Lin’s similarity [2] normalizes Resnik’s similarity by the sum of the ICs of both terms:
Jiang-Conrath Similarity
Jiang and Conrath define a distance between two GO terms based on IC [3]:
Similarity is then calculated as:
SimRel Similarity
The SimRel measure [4] combines Lin’s similarity with an exponential relevance factor:
Information Coefficient
Li et al. [5] propose a normalization using the minimum IC of the two terms:
GraphIC Similarity
The GraphIC measure uses the maximum graph depth of the two terms to scale the similarity:
Wang Similarity
The Wang similarity [6] considers the graph structure of GO by propagating weights from each term through its ancestors.
Each ancestor node receives a weight based on the decay factor (usually \(w = 0.8\)). The similarity is computed as:
where
\(A(t)\) is the set of ancestors of term \(t\) (including itself),
\(S_t(x)\) is the semantic contribution of ancestor \(x\) to term \(t\),
\(SV(t)\) is the total semantic value of term \(t\).
The key idea is that ancestors closer to the term contribute more to its meaning than distant ancestors, capturing the hierarchical semantics of the ontology without relying on external annotation statistics.
TopoICSim Similarity
The TopoICSim similarity [7] is a hybrid measure that combines information content and the topology of the GO graph. It is defined as:
where
\(DCA(t_1, t_2)\) is the set of disjunctive common ancestors of \(t_1\) and \(t_2\),
\(wSP(t, x)\) is the weighted shortest path (sum of inverse ICs) from \(t\) to ancestor \(x\),
\(wLP(x, r)\) is the weighted longest path from \(x\) to a root \(r\) in the ontology.
This measure captures both the specificity of the common ancestors and the topological distance between terms, providing a robust similarity score.
Batch Computation
All these similarity measures are available in efficient batch versions in the go3 library, taking full advantage of Rust’s parallelism.
Bibliography
P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI), 448–453. 1995.
D. Lin. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML), 296–304. 1998.
J. J. Jiang and D. W. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the International Conference on Research in Computational Linguistics (ROCLING), 19–33. 1997.
A. Schlicker, F. S. Domingues, J. Rahnenfuhrer, and T. Lengauer. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics, 7(1):302, 2006.
B. Li, J. Z. Wang, F. A. Feltus, J. Zhou, and F. Luo. Effectively integrating information content and structural relationship to improve the go-based similarity measure. Journal of Biomedical Informatics, 43(5):752–760, 2010.
J. Z. Wang, Z. Du, R. Payattakool, S. Y. Philip, and C. F. Chen. A new method to measure the semantic similarity of go terms. Bioinformatics, 23(10):1274–1281, 2007.
R. Ehsani and F. Drabløs. Topoicsim: a new semantic similarity measure based on gene ontology. BMC Bioinformatics, 17:296, 2016. doi:10.1186/s12859-016-1160-0.
- batch_similarity(list1, list2, method, counter)
Compute pairwise semantic similarity in batch using a selected method.
- Parameters:
list1 (list of str) – First list of GO term IDs.
list2 (list of str) – Second list of GO term IDs.
method (str) – Name of the similarity method.
counter (TermCounter) – Precomputed IC values.
- Returns:
List of similarity scores.
- Return type:
list of float
- Raises:
ValueError – If input lists differ in length or method is unknown.
- compare_gene_pairs_batch(pairs, ontology, similarity, groupwise, counter)
Compute semantic similarity between genes in batches.
- Parameters:
pairs (list of (str, str)) – List of pairs of genes to calculate the semantic similarity
ontology (str) – Name of the subontology of GO to use: BP, MF or CC.
similarity (str) – Name of the similarity method.
groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”.
counter (TermCounter) – Precomputed IC values.
- Returns:
List of similarity scores.
- Return type:
list of float
- Raises:
ValueError – If method or combine are unknown.
- compare_genes(gene1, gene2, ontology, similarity, groupwise, counter)
Compute semantic similarity between genes.
- Parameters:
gene1 (str) – Gene symbol of the first gene.
gene2 (str) – Gene symbol of the second gene.
ontology (str) – Name of the subontology of GO to use: BP, MF or CC.
similarity (str) – Name of the similarity method.
groupwise (str) – Combination method to generate the similarities between genes. Options: “bma”, “max”.
counter (TermCounter) – Precomputed IC values.
- Returns:
Similarity score.
- Return type:
float
- Raises:
ValueError – If method or combine are unknown.
- semantic_similarity(id1, id2, method, counter)
Compute semantic similarity between two GO terms using a selected method.
- Parameters:
id1 (str) – First GO term ID.
id2 (str) – Second GO term ID.
method (str) – Name of the similarity method. Options: “resnik”, “lin”, etc.
counter (TermCounter) – Precomputed IC values.
- Returns:
Similarity score.
- Return type:
float
- Raises:
ValueError – If the method is unknown.
- term_ic(go_id, counter)
Compute the Information Content (IC) of a GO term.
- Parameters:
go_id (str) – GO term identifier.
counter (TermCounter) – Precomputed term counter with IC values.
- Returns:
The IC of the GO term.
- Return type:
float