Metadata-Version: 2.2
Name: mimosa-tool
Version: 1.1.4
Summary: Model-Independent Motif Similarity Assessment tool
Author-Email: Anton Tsukanov <tsukanov@bionet.nsc.ru>
License: MIT
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/ubercomrade/mimosa
Project-URL: Repository, https://github.com/ubercomrade/mimosa
Project-URL: Documentation, https://github.com/ubercomrade/mimosa#readme
Requires-Python: >=3.10
Requires-Dist: numpy<2.4,>=2.0
Requires-Dist: numba>=0.62.0
Requires-Dist: scipy>=1.14.1
Requires-Dist: pandas>=2.2.3
Requires-Dist: joblib>=1.5.3
Description-Content-Type: text/markdown

# MIMOSA

Model-Independent Motif Similarity Assessment (MIMOSA) is tool designed to support comparisons across different motif model types.

## Introduction

Transcription factors (TFs) serve as fundamental regulators of gene expression levels. These proteins modulate the activity of the RNA polymerase complex by binding to specific DNA sequences located within regulatory regions, such as promoters and enhancers [1]. The specific DNA segment recognized by a TF is termed a transcription factor binding site (TFBS). TFBSs for a given TF are typically similar but not identical; therefore, they are described using *motifs* that capture the variability of the recognized sequences [2]. A variety of high-throughput experimental methods, including ChIP-seq, HT-SELEX, and DAP-seq, are currently used to identify TFBS motifs [3-5]. While motifs are most frequently represented as Position Weight Matrices (PWMs), a standard supported by widely used *de novo* motif discovery tools like MEME [6], STREME [7], and HOMER [8], the field has increasingly adopted alternative models to capture complex nucleotide dependencies. These include diverse variants of Markov Models (BaMMs, InMoDe, DIMONT etc.) [9-14], which account for higher-order dependencies that PWMs ignore, as well as models based on locally positioned dinucleotides (SiteGA) [15-16] and deep learning architectures (DeepBind, DeeperBind, DeepGRN and etc.) [17-21].

The identification of a motif is only the first step; establishing its biological context requires robust comparison methods. Comparing motifs is essential for determining whether a newly discovered pattern represents a novel specificity or a variation of a known factor, for clustering redundant motifs identified across different experiments, and for inferring functional relationships between TFs based on binding similarity. Several established tools address this need, including Tomtom [22], STAMP [23], MACRO-APE [24] and MoSBAT [25]. These tools utilize various metrics and algorithms to quantify similarity, ranging from column-wise matrix correlations to Jaccard index-based comparisons of recognized site sets. However, a significant limitation of the current software ecosystem is its heavy reliance on matrix-based representations (PFMs or PWMs). This constraint makes it challenging to directly compare alternative models, such as Markov models or dinucleotide models, without converting them into simpler matrix formats, a process that often results in information loss.

To address these limitations, we introduce MIMOSA, a comprehensive framework designed to facilitate the comparison of diverse motif models beyond standard frequency matrices. MIMOSA exposes three comparison modes. The `profile` mode is the universal workflow: it compares TFBS recognition profiles, either from precomputed score tracks or from profiles generated by scanning sequences with motifs, conceptually similar to affinity-based approaches [25]. The `motif` mode performs direct matrix or tensor alignment for models with compatible representations and falls back to sequence-driven PFM reconstruction for heterogeneous model pairs (for example, BaMM vs PWM) [22], [26]. The `motali` mode incorporates MoTaLi ([see details](https://github.com/parthian-sterlet/motali)).

### Methodology

#### Similarity Metrics

MIMOSA implements several metrics to quantify the resemblance between motif importance profiles or matrix columns.

**Continuous Jaccard (CJ)**
The Continuous Jaccard index extends the classical Jaccard similarity to continuous-valued vectors $v_1, v_2$. It is defined as the ratio of the sum of element-wise intersections to the sum of element-wise unions:
$$\text{CJ}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\sum_i \max(v_1^i, v_2^i)}$$
This metric is equivalent to averaging the binary Jaccard index across all possible thresholds, providing a threshold-independent measure of profile similarity.

**Continuous Overlap (CO)**
The Continuous Overlap coefficient (or Szymkiewicz-Simpson coefficient) measures the sub-set relationship between two profiles, normalizing the intersection by the smaller of the two total affinities:
$$\text{CO}(v_1, v_2) = \frac{\sum_i \min(v_1^i, v_2^i)}{\min\left(\sum_i v_1^i, \sum_i v_2^i\right)}$$

**Pearson Correlation Coefficient (PCC)**
For linear correlation between profiles or motif columns, the PCC is calculated as:
$$\text{PCC}(v_1, v_2) = \frac{\sum_i (v_1^i - \bar{v}_1)(v_2^i - \bar{v}_2)}{\sqrt{\sum_i (v_1^i - \bar{v}_1)^2 \sum_i (v_2^i - \bar{v}_2)^2}}$$

#### Motif Matrix/Tensor Comparison

The `motif` mode follows the matrix alignment idea of Tomtom [22].  
If model representations are directly compatible matrices/tensors (same model class), MIMOSA compares them directly.  
If models are heterogeneous, MIMOSA switches to sequence-driven PFM reconstruction and then applies the same alignment/scoring logic.

When `--pfm-mode` is enabled (or model types are different), MIMOSA uses the following protocol:

1. **Best-site extraction per sequence**  
   For each input sequence $x_i$, the model score is maximized over position and strand:
   $$
   (\hat{p}_i, \hat{\sigma}_i, \hat{s}_i) =
   \arg\max_{p,\sigma \in \{+,-\}} \text{Score}_m(x_i, p, \sigma)
   $$
   This yields one best site (length $L$) and one best score $\hat{s}_i$ per sequence.

2. **Top-scoring site filtering (25%)**  
   Sites are sorted by $\hat{s}_i$ and only the strongest quartile is retained:
   $$
   K = \max\left(1, \left\lfloor 0.25N \right\rfloor\right)
   $$
   where $N$ is the number of sequences.

3. **PFM reconstruction from selected sites**  
   Let $\mathcal{I}_{\text{top}}$ be indices of retained sites. Raw counts are:
   $$
   C_{b,j} = \sum_{i \in \mathcal{I}_{\text{top}}} \mathbf{1}[w_i[j] = b], \quad b \in \{A,C,G,T\}
   $$
   where $w_i[j]$ is nucleotide at position $j$ in site $i$. Smoothed frequencies are:
   $$
   F_{b,j} = \frac{C_{b,j} + \lambda}{\sum_{b' \in \{A,C,G,T\}} \left(C_{b',j} + \lambda\right)}
   $$
   (MIMOSA uses additive smoothing before normalization).

4. **Column-wise matrix comparison with alignment**  
   For an overlap of length $L_{\delta}$ at offset $\delta$, compare column vectors
   $u_t, v_t \in \mathbb{R}^{d}$:
   $$
   \text{PCC}(u_t,v_t) =
   \frac{\sum_k (u_{k,t}-\bar{u}_t)(v_{k,t}-\bar{v}_t)}
   {\sqrt{\sum_k (u_{k,t}-\bar{u}_t)^2}\sqrt{\sum_k (v_{k,t}-\bar{v}_t)^2}}
   $$
   $$
   \text{COS}(u_t,v_t) =
   \frac{\sum_k u_{k,t}v_{k,t}}
   {\sqrt{\sum_k u_{k,t}^2}\sqrt{\sum_k v_{k,t}^2}}
   $$
   $$
   \text{ED}(u_t,v_t) = \left\lVert u_t - v_t \right\rVert_2
   $$
   Alignment scores are averaged across overlapping columns:
   $$
   S_{\text{PCC/COS}}(\delta) = \frac{1}{L_{\delta}} \sum_{t=1}^{L_{\delta}} m(u_t,v_t), \quad
   m \in \{\text{PCC}, \text{COS}\}
   $$
   $$
   S_{\text{ED}}(\delta) = -\frac{1}{L_{\delta}} \sum_{t=1}^{L_{\delta}} \text{ED}(u_t,v_t)
   $$
   (negative sign makes higher values better for all metrics).

5. **Best offset and strand orientation**  
   MIMOSA evaluates direct (`++`) and reverse-complement (`+-`) orientations and returns:
   $$
   S^* = \max_{\omega \in \{++, +-\}} \max_{\delta:\,L_{\delta}\ge \frac{1}{2}\min(L_1,L_2)}
   S(\delta,\omega)
   $$
   i.e., the best score among admissible overlaps (at least half of the shorter motif length).

This design preserves the Tomtom-style matrix comparison logic [22], while enabling comparisons for heterogeneous model classes through sequence-driven PFM reconstruction.

#### Null Hypothesis and Surrogate Generation

To estimate the statistical significance (p-values) of observed similarity scores, MIMOSA employs a **Surrogate Null Model**.

1. **Convolutional Distortion** (for `profile` mode): surrogate profiles are built as follows:
    * **Odd kernel-size sampling**: kernel size is sampled within [`min_kernel_size`, `max_kernel_size`] from odd values.
    * **Random kernel draw**: kernel coefficients are sampled from a normal distribution and smoothed with a short filter.
    * **Identity mixing**: the random kernel is mixed with an identity (delta) kernel using the distortion coefficient `alpha` (`--distortion`), where `alpha=0` keeps identity and `alpha=1` gives fully random distortion.
    * **Optional sign flip**: the final kernel can be negated with probability 0.5.
    * **Segment-wise convolution**: each ragged sequence segment is convolved independently, then converted back to frequency space.

2. **Permutation**: for matrix-based comparisons (`motif`), the tool performs random column-wise permutations.
   For $R$ permutations, the empirical p-value is computed as:
   $$
   p = \frac{1 + \sum_{r=1}^{R} \mathbf{1}[S_r \ge S_{\text{obs}}]}{R + 1}
   $$
   where $S_{\text{obs}}$ is the observed similarity score and $S_r$ are surrogate scores.

This methodology ensures that the null distribution reflects realistic background similarity.

## Installation

MIMOSA requires **Python 3.10 or higher**.

### From PyPI (Recommended)

The easiest way to install MIMOSA is via `pip` or `uv`. This will automatically download and install all required dependencies.

```bash
# Using uv (Fastest)
uv pip install mimosa-tool

# Using pip
pip install mimosa-tool
```

### From Source

If you want to contribute to development or build the latest version from the repository, you will need a C++ compiler with **C++17 support** (e.g., GCC, Clang, or MSVC).

```bash
# Clone the repository
git clone https://github.com/ubercomrade/mimosa.git
cd mimosa

# Install in editable mode
pip install -e .
```

### Dependencies

When installing via `pip`, the following dependencies are resolved automatically:

* `numpy` (>= 2.0, < 2.4)
* `numba` (>= 0.62.0)
* `scipy` (>= 1.14.1)
* `pandas` (>= 2.2.3)
* `joblib` (>= 1.5.3)

### Build Requirements (Source only)

To build the C++ extension from source, the following tools are used:

* `scikit-build-core` (>= 0.10)
* `nanobind` (>= 2.0)

## CLI Reference

The `mimosa` tool provides three operation modes.

### `profile` mode

`profile` is the universal workflow. It compares score profiles and accepts either:

- precomputed FASTA-like score files via `--model*-type scores`
- motif models (`pwm`, `bamm`, `sitega`) that are first scanned on sequences to obtain profiles

**Example data**: [`examples/scores_1.fasta`](examples/scores_1.fasta), [`examples/pif4.meme`](examples/pif4.meme)

```bash
# Compare two precomputed score profiles
mimosa profile scores_1.fasta scores_2.fasta \
  --model1-type scores \
  --model2-type scores \
  --metric cj \
  --permutations 1000

# Compare two motifs through sequence-derived profiles
mimosa profile foxa2.meme gata4.meme \
  --model1-type pwm \
  --model2-type pwm \
  --fasta foreground.fa \
  --metric co \
  --permutations 1000
```

**Parameters for `profile` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first input file. |
| `model2` | Path | Path to the second input file. |
| `--model1-type` | `scores`, `pwm`, `bamm`, `sitega` | Format of the first input (required). |
| `--model2-type` | `scores`, `pwm`, `bamm`, `sitega` | Format of the second input (required). |
| `--fasta` | Path | FASTA file used to scan motif inputs. If omitted when scanning is needed, random sequences are generated. |
| `--num-sequences` | Integer | Number of generated sequences for scanning mode (default: `1000`). |
| `--seq-length` | Integer | Length of generated sequences for scanning mode (default: `200`). |
| `--metric` | `cj`, `co`, `corr` | Similarity metric for profile comparison (default: `cj`). |
| `--permutations` | Integer | Number of permutations for p-value calculation (default: `0`). |
| `--distortion` | Float | Distortion level for surrogate profile generation (default: `0.4`). |
| `--search-range` | Integer | Maximum offset range explored during alignment (default: `10`). |
| `--min-kernel-size` | Integer | Minimum surrogate convolution kernel size; the range must include an odd value (default: `3`). |
| `--max-kernel-size` | Integer | Maximum surrogate convolution kernel size; the range must include an odd value (default: `11`). |
| `--seed` | Integer | Global random seed. |
| `--jobs` | Integer | Number of parallel jobs (`-1` uses all cores). |
| `-v`, `--verbose` | Flag | Enable verbose logging. |

### `motif` mode

`motif` performs direct matrix or tensor comparison. It is the renamed former `tomtom-like` workflow.

**Example models**: [`examples/pif4.pfm`](examples/pif4.pfm), [`examples/pif4.meme`](examples/pif4.meme)

```bash
mimosa motif pif4.pfm pif4.meme \
  --model1-type pwm \
  --model2-type pwm \
  --metric cosine \
  --permutations 1000
```

When `--pfm-mode` is enabled, or when the model types differ, MIMOSA reconstructs PFMs from sequence hits before comparison.

**Parameters for `motif` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first motif model file. |
| `model2` | Path | Path to the second motif model file. |
| `--model1-type` | `pwm`, `bamm`, `sitega` | Format of the first model (required). |
| `--model2-type` | `pwm`, `bamm`, `sitega` | Format of the second model (required). |
| `--fasta` | Path | Optional FASTA file for PFM reconstruction. If omitted when reconstruction is needed, random sequences are generated. |
| `--num-sequences` | Integer | Number of generated sequences for PFM reconstruction (default: `20000`). |
| `--seq-length` | Integer | Length of generated sequences for PFM reconstruction (default: `100`). |
| `--metric` | `pcc`, `ed`, `cosine` | Column-wise comparison metric (default: `pcc`). |
| `--permutations` | Integer | Number of Monte Carlo permutations (default: `0`). |
| `--permute-rows` | Flag | Shuffle matrix rows in addition to positions during permutations. |
| `--pfm-mode` | Flag | Force sequence-driven PFM reconstruction before comparison. |
| `--seed` | Integer | Global random seed. |
| `--jobs` | Integer | Number of parallel jobs (`-1` uses all cores). |
| `-v`, `--verbose` | Flag | Enable verbose logging. |

### `motali` mode

`motali` keeps the MoTaLi-based comparison workflow.

**Example models**: [`examples/sitega_gata2.mat`](examples/sitega_gata2.mat), [`examples/gata2.meme`](examples/gata2.meme)

```bash
mimosa motali sitega_gata2.mat gata2.meme \
  --model1-type sitega \
  --model2-type pwm \
  --fasta foreground.fa \
  --promoters background.fa
```

**Parameters for `motali` mode**:

| Flag | Value | Comment |
| :--- | :--- | :--- |
| `model1` | Path | Path to the first motif model file. |
| `model2` | Path | Path to the second motif model file. |
| `--model1-type` | `pwm`, `sitega` | Format of the first model (required). |
| `--model2-type` | `pwm`, `sitega` | Format of the second model (required). |
| `--fasta` | Path | FASTA file with target sequences. If omitted, random sequences are generated. |
| `--promoters` | Path | FASTA file with promoter sequences for threshold calculation. |
| `--num-sequences` | Integer | Number of generated sequences (default: `10000`). |
| `--seq-length` | Integer | Length of generated sequences (default: `200`). |
| `--tmp-dir` | Path | Directory for temporary files (default: `.`). |
| `--err` | Float | Expected recognition rate cutoff (default: `0.002`). |
| `--shift` | Integer | Maximum motif-center shift (default: `50`). |
| `-v`, `--verbose` | Flag | Enable verbose logging. |

## Library Usage

MIMOSA exposes a functional API. The core building blocks are:

- `GenericModel` (`mimosa.models`) as an immutable model container.
- `read_model(...)`, `scan_model(...)`, `get_sites(...)`, `get_pfm(...)` (`mimosa.models`) for model I/O and scanning.
- `create_comparator_config(...)` and `compare(...)` (`mimosa.comparison`) for direct strategy execution.
- `compare_motifs(...)`, `create_config(...)`, `run_comparison(...)` (`mimosa`) as high-level entry points.

### Implementing a Custom Model Type

Custom models are added through the model strategy registry (`mimosa.models.registry`), not by subclassing a base model class.

```python
import os
import joblib
import numpy as np

from mimosa.models import GenericModel
from mimosa.models import registry as model_registry
from mimosa.ragged import RaggedData, ragged_from_list


def scan_dinuc_scores(sequences: RaggedData, matrix: np.ndarray, strand: str) -> RaggedData:
    """Scan sequences with a dinucleotide matrix of shape (16, motif_length-1)."""
    motif_len = matrix.shape[1] + 1
    rc_table = np.array([3, 2, 1, 0, 4], dtype=np.int8)
    result = []

    for i in range(sequences.num_sequences):
        seq = sequences.get_slice(i)
        if strand == "-":
            seq = rc_table[seq[::-1]]

        if len(seq) < motif_len:
            result.append(np.array([], dtype=np.float32))
            continue

        n_pos = len(seq) - motif_len + 1
        scores = np.zeros(n_pos, dtype=np.float32)

        for pos in range(n_pos):
            window = seq[pos : pos + motif_len]
            score = 0.0
            for k in range(motif_len - 1):
                a = int(window[k])
                b = int(window[k + 1])
                if a < 4 and b < 4:
                    dinuc_idx = a * 4 + b
                    score += matrix[dinuc_idx, k]
            scores[pos] = score

        result.append(scores)

    return ragged_from_list(result, dtype=np.float32)


@model_registry.register("dinuc")
class DinucStrategy:
    """Example custom strategy for a dinucleotide model."""

    @staticmethod
    def scan(model: GenericModel, sequences: RaggedData, strand: str) -> RaggedData:
        representation = model.representation.astype(np.float32)
        if strand == "+":
            return scan_dinuc_scores(sequences, representation, "+")
        if strand == "-":
            return scan_dinuc_scores(sequences, representation, "-")
        if strand == "best":
            sf = scan_dinuc_scores(sequences, representation, "+")
            sr = scan_dinuc_scores(sequences, representation, "-")
            return RaggedData(np.maximum(sf.data, sr.data), sf.offsets)
        raise ValueError(f"Invalid strand mode: {strand}")

    @staticmethod
    def write(model: GenericModel, path: str) -> None:
        joblib.dump(model, path)

    @staticmethod
    def score_bounds(model: GenericModel) -> tuple[float, float]:
        # Approximation: valid for many practical cases, but not a strict bound
        # for all dependency-aware models.
        rep = model.representation
        min_score = rep.min(axis=0).sum()
        max_score = rep.max(axis=0).sum()
        return float(min_score), float(max_score)

    @staticmethod
    def load(path: str, kwargs: dict) -> GenericModel:
        if path.endswith(".pkl"):
            return joblib.load(path)
        matrix = np.load(path)  # expected shape: (16, motif_length-1)
        name = kwargs.get("name", os.path.splitext(os.path.basename(path))[0])
        length = int(matrix.shape[-1] + 1)
        return GenericModel(
            type_key="dinuc",
            name=name,
            length=length,
            representation=matrix.astype(np.float32),
            config={"kmer": 2},
        )
```

Important: this module must be imported before calling `read_model(..., "dinuc")`
or any comparison that relies on this model type. Registration happens at import time.

```python
from mimosa import compare_motifs
from mimosa.io import read_fasta
from mimosa.models import read_model

# Ensure DinucStrategy registration code above has already run in this process.
model1 = read_model("my_custom.npy", "dinuc")
model2 = read_model("examples/pif4.meme", "pwm")
sequences = read_fasta("examples/foreground.fa")

result = compare_motifs(
    model1=model1,
    model2=model2,
    strategy="profile",
    sequences=sequences,
    metric="co",
    n_permutations=100,
    seed=42,
)
print(result)
```

### Strategy Contract

A model strategy registered in `mimosa.models.registry` must provide:

| Method | Description |
| :--- | :--- |
| `scan(model, sequences, strand)` | Required. Returns `RaggedData` with positional scores. |
| `write(model, path)` | Required. Serializes model data. |
| `score_bounds(model)` | Required for threshold table generation. |
| `load(path, kwargs)` | Required. Builds and returns a `GenericModel`. |

### Recommended: Unified Config API

```python
from mimosa import compare_motifs
from mimosa.io import read_fasta
from mimosa.models import read_model

model1 = read_model("examples/pif4.meme", "pwm")
model2 = read_model("examples/gata2.ihbcp", "bamm")
sequences = read_fasta("examples/foreground.fa")

result = compare_motifs(
    model1=model1,
    model2=model2,
    strategy="profile",  # "profile", "motif", or "motali"
    sequences=sequences,
    metric="co",
    n_permutations=100,
    seed=42,
)

print(result)
```

### Example: Direct API Comparison

```python
from mimosa.comparison import compare, create_comparator_config
from mimosa.io import read_fasta
from mimosa.models import read_model

# Load models in supported formats (pwm, bamm, sitega, scores, or custom registered type)
model1 = read_model("examples/pif4.meme", "pwm")
model2 = read_model("examples/gata2.meme", "pwm")

# Sequences are integer-encoded (A=0, C=1, G=2, T=3, N=4)
sequences = read_fasta("examples/foreground.fa")

config = create_comparator_config(
    metric="cj",
    n_permutations=100,
    seed=42,
    search_range=10,
)

result = compare(
    model1=model1,
    model2=model2,
    strategy="profile",  # "profile", "motif", or "motali"
    config=config,
    sequences=sequences,
)

print(result)
```

### Examples

The [`examples/`](examples/) directory contains sample data and scripts (`examples/run.sh`, `examples/run.ps1`) for CLI workflows.

## Bibliography

1. Lambert, S. A., Jolma, A., Campitelli, L. F., Das, P. K., Yin, Y., Albu, M., ... & Weirauch, M. T. (2018). The human transcription factors. _Cell_, _172_(4), 650-665.

2. Wasserman, W. W., & Sandelin, A. (2004). Applied bioinformatics for the identification of regulatory elements. _Nature Reviews Genetics, 5 (4), 276-287.

3. Park, P. J. (2009). ChIP–seq: advantages and challenges of a maturing technology. _Nature reviews genetics_, _10_(10), 669-680.

4. Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, M., Vaquerizas, J. M., Yan, J., Sillanpää, M. J., Bonke, M., Palin, K., Talukder, S., Hughes, T. R., Luscombe, N. M., Ukkonen, E., & Taipale, J. (2010). Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. _Genome research_, _20_(6), 861–873. https://doi.org/10.1101/gr.100552.109

5. O'Malley, R. C., Huang, S. C., Song, L., Lewsey, M. G., Bartlett, A., Nery, J. R., Galli, M., Gallavotti, A., & Ecker, J. R. (2016). Cistrome and Epicistrome Features Shape the Regulatory DNA Landscape. _Cell_, _165_(5), 1280–1292. https://doi.org/10.1016/j.cell.2016.04.038

6. Bailey, T. L., & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in biopolymers.Proceedings. International Conference on Intelligent Systems for Molecular Biology_, _2_, 28–36.

7. Bailey T. L. (2021). STREME: accurate and versatile sequence motif discovery. _Bioinformatics (Oxford, England)_, _37_(18), 2834–2840. https://doi.org/10.1093/bioinformatics/btab203

8. Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H., & Glass, C. K. (2010). Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. _Molecular cell_, _38_(4), 576–589. https://doi.org/10.1016/j.molcel.2010.05.004

9. Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013 Nov;41(21):e197. doi: 10.1093/nar/gkt831. Epub 2013 Sep 20. PMID: 24057214; PMCID: PMC3834837.

10. Eggeling R, Grosse I, Grau J. InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites. Bioinformatics. 2017 Feb 15;33(4):580-582. doi: 10.1093/bioinformatics/btw689. PMID: 28035026; PMCID: PMC5408807.

11. Siebert, M., & Söding, J. (2016). Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences. _Nucleic acids research_, _44_(13), 6055–6069. https://doi.org/10.1093/nar/gkw521

12. Ge, W., Meier, M., Roth, C., & Söding, J. (2021). Bayesian Markov models improve the prediction of binding motifs beyond first order. _NAR genomics and bioinformatics_, _3_(2), lqab026. https://doi.org/10.1093/nargab/lqab026

13. Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics. 2020 May 1;36(9):2690-2696. doi: 10.1093/bioinformatics/btaa045. PMID: 31999322; PMCID: PMC7203737.

14. Mathelier, A., & Wasserman, W. W. (2013). The next generation of transcription factor binding site prediction. _PLoS computational biology_, _9_(9), e1003214. https://doi.org/10.1371/journal.pcbi.1003214

15. Levitsky, V. G., Ignatieva, E. V., Ananko, E. A., Turnaev, I. I., Merkulova, T. I., Kolchanov, N. A., & Hodgman, T. C. (2007). Effective transcription factor binding site prediction using a combination of optimization, a genetic algorithm and discriminant analysis to capture distant interactions. _BMC bioinformatics_, _8_, 481. https://doi.org/10.1186/1471-2105-8-481

16. Tsukanov, A. V., Mironova, V. V., & Levitsky, V. G. (2022). Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis. _Frontiers in plant science_, _13_, 938545. https://doi.org/10.3389/fpls.2022.938545

17. Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. _Nature biotechnology_, _33_(8), 831–838. https://doi.org/10.1038/nbt.3300

18. Hassanzadeh, H. R., & Wang, M. D. (2016). DeeperBind: Enhancing Prediction of Sequence Specificities of DNA Binding Proteins. _Proceedings. IEEE International Conference on Bioinformatics and Biomedicine_, _2016_, 178–183. https://doi.org/10.1109/bibm.2016.7822515

19. Chen, C., Hou, J., Shi, X., Yang, H., Birchler, J. A., & Cheng, J. (2021). DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. _BMC bioinformatics_, _22_(1), 38. https://doi.org/10.1186/s12859-020-03952-1

20. Wang, K., Zeng, X., Zhou, J., Liu, F., Luan, X., & Wang, X. (2024). BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. _Briefings in bioinformatics_, _25_(3), bbae195. https://doi.org/10.1093/bib/bbae195

21. Jing Zhang, F., Zhang, S. W., & Zhang, S. (2022). Prediction of Transcription Factor Binding Sites With an Attention Augmented Convolutional Neural Network. _IEEE/ACM transactions on computational biology and bioinformatics_, _19_(6), 3614–3623. https://doi.org/10.1109/TCBB.2021.3126623

22. Gupta, S., Stamatoyannopoulos, J. A., Bailey, T. L., & Noble, W. S. (2007). Quantifying similarity between motifs. _Genome biology_, _8_(2), R24. https://doi.org/10.1186/gb-2007-8-2-r24 (PMC: https://pmc.ncbi.nlm.nih.gov/articles/PMC1852410/)

23. Mahony, S., & Benos, P. V. (2007). STAMP: a web tool for exploring DNA-binding motif similarities. _Nucleic acids research_, _35_(Web Server issue), W253–W258. https://doi.org/10.1093/nar/gkm272

24. Vorontsov, I. E., Kulakovskiy, I. V., & Makeev, V. J. (2013). Jaccard index based similarity measure to compare transcription factor binding site models. _Algorithms for molecular biology : AMB_, _8_(1), 23. https://doi.org/10.1186/1748-7188-8-23

25. Lambert, S. A., Albu, M., Hughes, T. R., & Najafabadi, H. S. (2016). Motif comparison based on similarity of binding affinity profiles. _Bioinformatics (Oxford, England)_, _32_(22), 3504–3506. https://doi.org/10.1093/bioinformatics/btw489

26. van Dongen, S., & Enright, A. J. (2012). Metric distances derived from cosine similarity and Pearson and Spearman correlations. _arXiv preprint_, arXiv:1208.3145. https://doi.org/10.48550/arXiv.1208.3145
