Metadata-Version: 2.2
Name: mimosa-tool
Version: 1.0.0
Summary: Model-Independent Motif Similarity Assessment tool
Author-Email: Anton Tsukanov <tsukanov@bionet.nsc.ru>
License: MIT
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/ubercomrade/mimosa
Project-URL: Repository, https://github.com/ubercomrade/mimosa
Project-URL: Documentation, https://github.com/ubercomrade/mimosa#readme
Requires-Python: >=3.10
Requires-Dist: numpy<2.4,>=2.0
Requires-Dist: numba>=0.62.0
Requires-Dist: scipy>=1.14.1
Requires-Dist: pandas>=2.2.3
Requires-Dist: joblib>=1.5.3
Description-Content-Type: text/markdown

# MIMOSA

Model-Independent Motif Similarity Assessment (MIMOSA) is tool designed to support comparisons across different motif model types.

## Introduction

Transcription factors (TFs) serve as fundamental regulators of gene expression levels. These proteins modulate the activity of the RNA polymerase complex by binding to specific DNA sequences located within regulatory regions, such as promoters and enhancers [1]. The specific DNA segment recognized by a TF is termed a transcription factor binding site (TFBS). TFBSs for a given TF are typically similar but not identical; therefore, they are described using *motifs* that capture the variability of the recognized sequences [2]. A variety of high-throughput experimental methods, including ChIP-seq, HT-SELEX, and DAP-seq, are currently used to identify TFBS motifs [3-5]. While motifs are most frequently represented as Position Weight Matrices (PWMs), a standard supported by widely used *de novo* motif discovery tools like MEME [6], STREME [7], and HOMER [8], the field has increasingly adopted alternative models to capture complex nucleotide dependencies. These include diverse variants of Markov Models (BaMMs, InMoDe, DIMONT etc.) [9-14], which account for higher-order dependencies that PWMs ignore, as well as models based on locally positioned dinucleotides (SiteGA) [15-16] and deep learning architectures (DeepBind, DeeperBind, DeepGRN and etc.) [17-21].

The identification of a motif is only the first step; establishing its biological context requires robust comparison methods. Comparing motifs is essential for determining whether a newly discovered pattern represents a novel specificity or a variation of a known factor, for clustering redundant motifs identified across different experiments, and for inferring functional relationships between TFs based on binding similarity. Several established tools address this need, including Tomtom [22], STAMP [23], MACRO-APE [24] and MoSBAT [25]. These tools utilize various metrics and algorithms to quantify similarity, ranging from column-wise matrix correlations to Jaccard index-based comparisons of recognized site sets. However, a significant limitation of the current software ecosystem is its heavy reliance on matrix-based representations (PFMs or PWMs). This constraint makes it challenging to directly compare alternative models, such as Markov models or dinucleotide models, without converting them into simpler matrix formats, a process that often results in information loss.

To address these limitations, we introduce MIMOSA, a comprehensive framework designed to facilitate the comparison of diverse motif models beyond standard frequency matrices. MIMOSA implements four distinct modes of comparison to accommodate various analytical needs. The first and most universal mode involves the direct comparison of TFBS recognition profiles generated by different motifs, conceptually similar to affinity-based approaches [25]. This allows for the assessment of similarity based on the functional output of the models—the scores assigned to sequences—rather than their internal parameters. The second mode leverages the same underlying approach but allows the user to explicitly define the model architecture; currently, MIMOSA supports three specific model types: PWM, BMM, and SiteGA, with an extensible architecture designed to accommodate future model types. The third mode incorporates MoTaLi ([see details](https://github.com/parthian-sterlet/motali)). Finally, the fourth mode provides a Tomtom-like functionality for scenarios where models can be represented as N-dimensional matrix. In this mode, if the models are compatible matrix formats, they are compared using standard metrics such as Pearson Correlation Coefficient (PCC), Euclidean Distance (ED), and Cosine similarity. Crucially, if the models are of heterogeneous types (e.g., comparing a BaMM to a PWM), MIMOSA employs a strategy of scanning sequences to generate recognition profiles, which are then used to reconstruct compatible Position Frequency Matrices for comparison, ensuring that even fundamentally different model types can be quantitatively evaluated within a single framework.

### Methodology

#### Similarity Metrics

MIMOSA implements several metrics to quantify the resemblance between motif importance profiles or matrix columns.

**Continuous Jaccard (CJ)**
The Continuous Jaccard index extends the classical Jaccard similarity to continuous-valued vectors $v_1, v_2$. It is defined as the ratio of the sum of element-wise intersections to the sum of element-wise unions:
$$\text{CJ}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\sum_i \max(v_1^i, v_2^i)}$$
This metric is equivalent to averaging the binary Jaccard index across all possible thresholds, providing a threshold-independent measure of profile similarity.

**Continuous Overlap (CO)**
The Continuous Overlap coefficient (or Szymkiewicz-Simpson coefficient) measures the sub-set relationship between two profiles, normalizing the intersection by the smaller of the two total affinities:
$$\text{CO}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\min\left(\sum_i v_1^i, \sum_i v_2^i\right)}$$

**Pearson Correlation Coefficient (PCC)**
For linear correlation between profiles or motif columns, the PCC is calculated as:
$$\text{PCC}(v_1, v_2) = \frac{\sum_i (v_1^i - \bar{v}_1)(v_2^i - \bar{v}_2)}{\sqrt{\sum_i (v_1^i - \bar{v}_1)^2 \sum_i (v_2^i - \bar{v}_2)^2}}$$

#### Null Hypothesis and Surrogate Generation

To estimate the statistical significance (p-values) of observed similarity scores, MIMOSA employs a **Surrogate Null Model**. Unlike simple permutations that destroy local dependencies, our tool generates synthetic "surrogate" profiles that preserve the marginal properties and biological plausibility (smoothness) of the original data.

1. **Convolutional Distortion**: For profile-based surrogates, a sophisticated distortion logic is applied:
    * **Kernel Selection**: A base kernel (smooth, edge, or double-peak) is selected to represent typical profile features.
    * **Controlled Perturbation**: Noise and gradient bias are added to introduce variation while maintaining structural integrity.
    * **Smoothing**: Convolution ensures the surrogate remains biologically realistic.
    * **Convex Combination**: The final surrogate is a blend of the identity kernel and the distorted kernel, controlled by a user-defined distortion parameter.

2. **Permutation**: For matrix-based comparisons (`tomtom-like`), the tool performs random column-wise permutations.

This methodology ensures that the null distribution reflects realistic background similarity.

## Installation

MIMOSA requires **Python 3.10 or higher**.

### From PyPI (Recommended)

The easiest way to install MIMOSA is via `pip` or `uv`. This will automatically download and install all required dependencies.

```bash
# Using uv (Fastest)
uv pip install mimosa-tool

# Using pip
pip install mimosa-tool
```

### From Source

If you want to contribute to development or build the latest version from the repository, you will need a C++ compiler with **C++17 support** (e.g., GCC, Clang, or MSVC).

```bash
# Clone the repository
git clone https://github.com/ubercomrade/mimosa.git
cd mimosa

# Install in editable mode
pip install -e .
```

### Dependencies

When installing via `pip`, the following dependencies are resolved automatically:

* `numpy` (>= 2.0, < 2.4)
* `numba` (>= 0.62.0)
* `scipy` (>= 1.14.1)
* `pandas` (>= 2.2.3)
* `joblib` (>= 1.5.3)

### Build Requirements (Source only)

To build the C++ extension from source, the following tools are used:

* `scikit-build-core` (>= 0.10)
* `nanobind` (>= 2.0)

## CLI Reference

The `mimosa` tool provides four main operation modes.

### `profile` mode

Compare motifs based on pre-calculated score profiles.

**Input**: Text files with numerical scores (comma, tab, or space-separated).
**Example Data**: [`examples/scores_1.fasta`](examples/scores_1.fasta)

```bash
# in the `examples` directory
mimosa profile scores_1.fasta scores_2.fasta \
  --metric cj \
  --permutations 1000 \
  --distortion 0.5 \
  --search-range 10
```

**All parameters for `profile` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `profile1` | Path | Path to the first profile file (FASTA-like format). |
| `profile2` | Path | Path to the second profile file (FASTA-like format). |
| `--metric` | `cj`, `co`, `corr` | Similarity metric: Continuous Jaccard, Continuous Overlap, or Pearson Correlation (default: `cj`). |
| `--permutations` | Integer | Number of permutations for p-value calculation (default: 0). |
| `--distortion` | Float | Distortion level (0.0-1.0) for surrogate generation (default: 0.4). |
| `--search-range` | Integer | Maximum offset range to explore when aligning profiles (default: 10). |
| `--min-kernel-size` | Integer | Minimum kernel size for surrogate convolution (default: 3). |
| `--max-kernel-size` | Integer | Maximum kernel size for surrogate convolution (default: 11). |
| `--seed` | Integer | Global random seed for reproducibility. |
| `--jobs` | Integer | Number of parallel jobs (-1 uses all cores) (default: -1). |
| `-v`, `--verbose` | Flag | Enable verbose logging. |

### `motif` mode

Compare motifs by scanning sequences with models and comparing the resulting profiles.

**Input**: Motif model files (PWM: `.meme`, `.pfm`; BaMM: `.ihbcp` + `.hbcp`; SiteGA: `.mat`).
**Example Models**: [`examples/foxa2.meme`](examples/foxa2.meme), [`examples/gata4.meme`](examples/gata4.meme)

```bash
# in the `examples` directory
mimosa foxa2.meme gata4.meme \
  --model1-type pwm \
  --model2-type pwm \
  --fasta examples/foreground.fa \
  --metric co \
  --permutations 1000 \
  --distortion 0.3
```

**All parameters for `motif` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first motif model file. |
| `model2` | Path | Path to the second motif model file. |
| `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
| `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
| `--fasta` | Path | FASTA file with target sequences. If omitted, random sequences are generated. |
| `--promoters` | Path | FASTA file with promoter sequences for threshold calculation. |
| `--num-sequences` | Integer | Number of random sequences to generate (default: 1000). |
| `--seq-length` | Integer | Length of random sequences (default: 200). |
| `--metric` | `cj`, `co`, `corr` | Similarity metric (default: `cj`). |
| `--permutations` | Integer | Number of permutations (default: 0). |
| `--distortion` | Float | Distortion level (default: 0.4). |
| `--search-range` | Integer | Maximum alignment offset (default: 10). |
| `--seed` | Integer | Global random seed. |
| `--jobs` | Integer | Number of parallel jobs (default: -1). |

### `motali` mode

Compare motifs by calculating Precision-Recall Curve (PRC) AUC derived from scanning sequences.

**Example Models**: [`examples/sitega_gata2.mat`](examples/sitega_gata2.mat), [`examples/gata2.meme`](examples/gata2.meme)

```bash
# in the `examples` directory
mimosa motali sitega_gata2.mat gata2.meme \
  --model1-type sitega \
  --model2-type pwm \
  --fasta foreground.fa \
  --promoters background.fa \
  --num-sequences 5000 \
  --seq-length 150
```

**All parameters for `motali` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first motif model file. |
| `model2` | Path | Path to the second motif model file. |
| `--model1-type` | `pwm`, `sitega` | Format of the first model (Required). |
| `--model2-type` | `pwm`, `sitega` | Format of the second model (Required). |
| `--fasta` | Path | FASTA file with target sequences. |
| `--promoters` | Path | FASTA file with promoter sequences (Required for thresholds). |
| `--num-sequences` | Integer | Number of random sequences (default: 10000). |
| `--seq-length` | Integer | Length of random sequences (default: 200). |
| `--tmp-dir` | Path | Directory for temporary files (default: `/tmp`). |

### `tomtom-like` mode

Compare motifs by direct N-dimetional matrix comparison (column-wise).

**Example Models**: [`examples/pif4.pfm`](examples/pif4.pfm), [`examples/pif4.meme`](examples/pif4.meme)

```bash
# in the `examples` directory
mimosa tomtom-like pif4.pfm pif4.meme \
  --model1-type pwm \
  --model2-type pwm \
  --metric cosine \
  --permutations 1000 \
  --pfm-mode \
  --num-sequences 10000 \
  --seq-length 100
```

**All parameters for `tomtom-like` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first motif model file. |
| `model2` | Path | Path to the second motif model file. |
| `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (Required). |
| `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (Required). |
| `--metric` | `pcc`, `ed`, `cosine` | Column-wise metric: Pearson Correlation, Euclidean Distance, or Cosine Similarity (default: `pcc`). |
| `--permutations` | Integer | Number of Monte Carlo permutations for p-value (default: 0). |
| `--permute-rows` | Flag | Shuffle values within columns during permutation. |
| `--pfm-mode` | Flag | Derive PFM by scanning sequences (useful for comparing different model types). |
| `--num-sequences` | Integer | Sequences for PFM mode (default: 20000). |
| `--seq-length` | Integer | Sequence length for PFM mode (default: 100). |
| `--seed` | Integer | Global random seed. |
| `--jobs` | Integer | Number of parallel jobs (default: -1). |

## Library Usage

MIMOSA is designed as an extensible framework. You can implement your own motif models by inheriting from the base abstractions provided in [`mimosa/`](mimosa/).

### Implementing a Custom Model

To create a new model, inherit from the [`MotifModel`](mimosa/models.py) class and implement the required methods. Below is a simplified example of a dinucleotide-based motif model implemented in pure Python/NumPy.

```python
import numpy as np
from mimosa.models import MotifModel, RaggedData
from mimosa.pipeline import Pipeline
from mimosa.ragged import ragged_from_list

class SimpleDinucleotideMotif(MotifModel):
    """Example of a custom motif model with dinucleotide dependencies."""
    
    def __init__(self, matrix, name, length):
        # matrix shape: (16, length - 1) representing 16 possible dinucleotides
        super().__init__(matrix=matrix, name=name, length=length)
        
    def scan(self, sequences: RaggedData, strand=None) -> RaggedData:
        """
        Scan sequences with the custom model.
        Returns RaggedData containing scores for each position.
        """
        all_scores = []
        for i in range(sequences.num_sequences):
            seq = sequences.get_slice(i)
            if len(seq) < self.length:
                all_scores.append(np.array([], dtype=np.float32))
                continue
            
            n_pos = len(seq) - self.length + 1
            scores = np.zeros(n_pos, dtype=np.float32)
            
            # Simple sliding window scoring logic
            for j in range(n_pos):
                subseq = seq[j : j + self.length]
                pos_score = 0.0
                for k in range(self.length - 1):
                    # Calculate dinucleotide index (0-15) for ACGT
                    # Assuming 0=A, 1=C, 2=G, 3=T
                    if subseq[k] < 4 and subseq[k+1] < 4:
                        dinucl_idx = int(subseq[k] * 4 + subseq[k+1])
                        pos_score += self.matrix[dinucl_idx, k]
                scores[j] = pos_score
            all_scores.append(scores)
            
        return ragged_from_list(all_scores, dtype=np.float32)

    @classmethod
    def from_file(cls, path: str, **kwargs) -> SimpleDinucleotideMotif:
        """Load the model from a file."""
        # Implementation of your file parsing logic
        matrix = np.load(path)
        name = kwargs.get('name', 'custom_motif')
        return cls(matrix, name, length=matrix.shape[1] + 1)

    @property
    def model_type(self) -> str:
        """Unique identifier for the model type."""
        return 'dinucleotide'

    def write(self, path: str):
        """Save the model to a file."""
        np.save(path, self.matrix)

# Register the subclass to enable factory methods and CLI support
MotifModel.register_subclass('dinucleotide', SimpleDinucleotideMotif)
```

### Key Methods to Override

To ensure compatibility with the internal comparison [`Pipeline`](mimosa/pipeline.py), you must override the following methods:

| Method | Description |
| :--- | :--- |
| `scan(sequences, strand)` | **Required.** Performs motif scanning on a set of sequences. Must accept `RaggedData` and return `RaggedData` containing position-wise scores. |
| `from_file(path, **kwargs)` | **Required.** Class method to initialize the model from a file path. Enables the use of `MotifModel.create_from_file(path, 'type')`. |
| `model_type` | **Required.** Property returning a unique string identifier for the model class. |
| `write(path)` | **Required.** Method to serialize the model to its native format. |

### Example: Running a Comparison

```python
# 1. Prepare sequences and models
# Encode sequence: A=0, C=1, G=2, T=3
seq_list = [np.array([0, 1, 2, 3, 0, 1, 2, 3], dtype=np.int8)]
sequences = ragged_from_list(seq_list, dtype=np.int8)

# Initialize custom models
# Matrix for 16 dinucleotides across 9 positions (motif length 10)
m1 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_A", 10)
m2 = SimpleDinucleotideMotif(np.random.rand(16, 9), "Motif_B", 10)

# 2. Execute comparison using the Pipeline
pipeline = Pipeline()
result = pipeline.execute_motif_comparison(
    model1=m1,
    model2=m2,
    sequences=sequences,
    promoters=sequences, # Used for threshold (FPR) calculation
    comparison_type='motif',
    metric='cj',
    n_permutations=1000
)

print(f"Similarity (CJ): {result['similarity']:.4f}")
print(f"P-value: {result['p_value']:.4e}")
```

### Examples

The [`examples/`](examples/) directory contains sample data and [script](examples/run.sh) to demonstrate the tool's capabilities with CLI.

To run a basic comparison:

```bash
mimosa motif examples/foxa2.meme examples/gata2.meme \
  --model1-type pwm --model2-type pwm \
  --fasta examples/foreground.fa --metric cj --perm 100
```

## Bibliography

1. Lambert, S. A., Jolma, A., Campitelli, L. F., Das, P. K., Yin, Y., Albu, M., ... & Weirauch, M. T. (2018). The human transcription factors. _Cell_, _172_(4), 650-665.

2. Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. _Nature Reviews Genetics, 5 (4), 276-287.

3. Park, P. J. (2009). ChIP–seq: advantages and challenges of a maturing technology. _Nature reviews genetics_, _10_(10), 669-680.

4. Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J., Bonke, M., Palin, K., Talukder, S., Hughes, T. R., Luscombe, N. M., Ukkonen, E., & Taipale, J. (2010). Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. _Genome research_, _20_(6), 861–873. https://doi.org/10.1101/gr.100552.109

5. O'Malley, R. C., Huang, S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli, M., Gallavotti, A., & Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. _Cell_, _165_(5), 1280–1292. https://doi.org/10.1016/j.cell.2016.04.038

6. Bailey, T. L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers.Proceedings. International Conference on Intelligent Systems for Molecular Biology_, _2_, 28–36.

7. Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. _Bioinformatics (Oxford, England)_, _37_(18), 2834–2840. https://doi.org/10.1093/bioinformatics/btab203

8. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. _Molecular cell_, _38_(4), 576–589. https://doi.org/10.1016/j.molcel.2010.05.004

9. Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20. PMID: 24057214; PMCID: PMC3834837.

10. Eggeling R, Grosse I, Grau J. InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites. Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689. PMID: 28035026; PMCID: PMC5408807.

11. Siebert, M., & Söding, J. (2016). Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. _Nucleic acids research_, _44_(13), 6055–6069. https://doi.org/10.1093/nar/gkw521

12. Ge, W., Meier, M., Roth, C., & Söding, J. (2021). Bayesian Markov models improve the prediction of binding motifs beyond first order. _NAR genomics and bioinformatics_, _3_(2), lqab026. https://doi.org/10.1093/nargab/lqab026

13. Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics. 2020 May 1;36(9):2690-2696. doi: 10.1093/bioinformatics/btaa045. PMID: 31999322; PMCID: PMC7203737.

14. Mathelier, A., & Wasserman, W. W. (2013). The next generation of transcription factor binding site prediction. _PLoS computational biology_, _9_(9), e1003214. https://doi.org/10.1371/journal.pcbi.1003214

15. Levitsky, V. G., Ignatieva, E. V., Ananko, E. A., Turnaev, I. I., Merkulova, T. I., Kolchanov, N. A., & Hodgman, T. C. (2007). Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. _BMC bioinformatics_, _8_, 481. https://doi.org/10.1186/1471-2105-8-481

16. Tsukanov, A. V., Mironova, V. V., & Levitsky, V. G. (2022). Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. _Frontiers in plant science_, _13_, 938545. https://doi.org/10.3389/fpls.2022.938545

17. Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. _Nature biotechnology_, _33_(8), 831–838. https://doi.org/10.1038/nbt.3300

18. Hassanzadeh, H. R., & Wang, M. D. (2016). DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins. _Proceedings. IEEE International Conference on Bioinformatics and Biomedicine_, _2016_, 178–183. https://doi.org/10.1109/bibm.2016.7822515

19. Chen, C., Hou, J., Shi, X., Yang, H., Birchler, J. A., & Cheng, J. (2021). DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. _BMC bioinformatics_, _22_(1), 38. https://doi.org/10.1186/s12859-020-03952-1

20. Wang, K., Zeng, X., Zhou, J., Liu, F., Luan, X., & Wang, X. (2024). BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. _Briefings in bioinformatics_, _25_(3), bbae195. https://doi.org/10.1093/bib/bbae195

21. Jing Zhang, F., Zhang, S. W., & Zhang, S. (2022). Prediction of Transcription Factor Binding Sites With an Attention Augmented Convolutional Neural Network. _IEEE/ACM transactions on computational biology and bioinformatics_, _19_(6), 3614–3623. https://doi.org/10.1109/TCBB.2021.3126623

22. Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying similarity between motifs. _Genome biology_, _8_(2), R24. https://doi.org/10.1186/gb-2007-8-2-r24

23. Mahony, S., & Benos, P. V. (2007). STAMP: a web tool for exploring DNA-binding motif similarities. _Nucleic acids research_, _35_(Web Server issue), W253–W258. https://doi.org/10.1093/nar/gkm272

24. Vorontsov, I. E., Kulakovskiy, I. V., & Makeev, V. J. (2013). Jaccard index based similarity measure to compare transcription factor binding site models. _Algorithms for molecular biology : AMB_, _8_(1), 23. https://doi.org/10.1186/1748-7188-8-23

25. Lambert, S. A., Albu, M., Hughes, T. R., & Najafabadi, H. S. (2016). Motif comparison based on similarity of binding affinity profiles. _Bioinformatics (Oxford, England)_, _32_(22), 3504–3506. https://doi.org/10.1093/bioinformatics/btw489
