Metadata-Version: 2.4
Name: massif
Version: 0.3.1
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: License :: OSI Approved :: CEA CNRS Inria Logiciel Libre License, version 2.1 (CeCILL-2.1)
License-File: LICENSE
Summary: Fast analysis of massive-scale data produced with MassiveFold
Author: Nessim Raouraoua
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

<div align="center">
    <img src="imgs/Massif.svg" alt="header" width="250">
    <h1></h1>
    <p>
        <strong>Fast analysis of massive-scale data produced with MassiveFold</strong>
    </p>
</div>

[![PyPI version](https://img.shields.io/pypi/v/massif.svg)](https://pypi.org/project/massif/)
[![Supported Python versions](https://img.shields.io/pypi/pyversions/massif.svg)](https://pypi.org/project/massif/)

## Introduction

Massif is a high-throughput analysis suite built to process the large structural ensembles generated by [MassiveFold](https://www.nature.com/articles/s43588-024-00714-4). It helps MassiveFold users review many predictions at once, evaluate interfaces and distances, and identify models that warrant follow-up. Instead of working through raw model folders manually, Massif gathers the metrics needed for filtering, ranking, and selecting structures in one place.

## Getting Started

Massif can be both installed as CLI tool and python libary via `pip` (requires a Rust toolchain).

Get the last massif release from the pypi release with:
```bash
python -m pip install massif
```

Or build an up-to-date version with non-tested but new features with:
```bash
python -m pip install .
```

### Python-installed CLI
After `pip install`, the `massif` command is available in your environment and uses the same CLI
syntax as the Rust binary:
```bash
massif --help
massif fit <OUTPUT_DIR> <REFERENCE_PDB> <CHAIN_IDS> <STRUCTURE_DIR> <OUTPUT_CSV>
```

### Python Package

Example usage:
```python
import massif

files = massif.structure_files("path/to/structures")
distances = massif.distances(
    "path/to/structures",
    "path/to/reference.pdb",
    distance_mode="TM-score",
)
```

Notes:
- `massif.distances` writes a CSV report in the current working directory.
- Functions print progress output to stdout while running.
- `pip install` also exposes a `massif` console script that runs the Rust CLI.



## Building from source
### Prerequisites
- Rust toolchain >= 1.74 (install via [`rustup`](https://rustup.rs/))
- A directory containing the structures you want to process (PDB or mmCIF files); filenames are sorted numerically on the first `_`-separated index

### Build
```bash
cargo build --release
```

### Command Help
```bash
cargo run -- --help
```

## Usage
Massif expects positional arguments in the following order:
```bash
massif <COMMAND> [COMMAND OPTIONS] <STRUCTURE_DIR> <OUTPUT_CSV> [OPTIONS]
```
- `STRUCTURE_DIR`: directory containing the input PDB/CIF files
- `OUTPUT_CSV`: base report name; data is currently written to `<OUTPUT_CSV>_alternative.*`
- `--disable-parallel`: force single-threaded execution (Rayon is enabled by default)

The `COMMAND` argument selects one of the following subcommands:

### `fit`
Align every structure against a reference chain, save aligned coordinates, and compute distances (currently TM-score).
```bash
massif fit <OUTPUT_DIR> <REFERENCE_PDB> <CHAIN_IDS> [METRIC] [DISTANCE_CHAINS] <STRUCTURE_DIR> <OUTPUT_CSV>
```
- `OUTPUT_DIR`: folder where aligned structures are written
- `REFERENCE_PDB`: path to the reference structure used for alignment and distance computation
- `CHAIN_IDS`: concatenated chain identifiers (for example `AB` or `C`) that define the fitting anchor in both reference and target structures
- `METRIC` (optional): `TM-score` (default) or `rmsd-cur`
- `DISTANCE_CHAINS` (optional): chain group used for the post-fit distance computation, including both `rmsd-cur` and `TM-score` (for example `AB`)
- Output columns: `TM-score to <reference>` plus `Models`

### `contacts`
Characterise interface contacts and clashes across the ensemble.
```bash
massif contacts <OUTPUT_DIR> <STRUCTURE_DIR> <OUTPUT_CSV>
```
- Reports the number of atomic clashes per model and prints the automatic exclusion threshold (mean + 2×SD)
- Adds interface score placeholders (future integration of pTM/ipTM based scoring)
- Aligned structures are not emitted; `OUTPUT_DIR` is reserved for future extensions

### `iplddt`
Compute the mean pLDDT over residues at a user-defined interface.
```bash
massif iplddt <AGGREGATE_1> <AGGREGATE_2> <THRESHOLD> <STRUCTURE_DIR> <OUTPUT_CSV>
```
- `AGGREGATE_1` / `AGGREGATE_2`: chain groups (for example `AB` vs `C`)
- `THRESHOLD`: distance cutoff (Å) between atoms to treat residues as contacting
- Returns an `i-plddt` column per model; failures are reported as `-1`

### `cluster`
Align every structure on a reference, reduce a selected chain group to one 3D point, and assign complete-linkage clusters in the reduced space.
```bash
massif cluster <REFERENCE_PDB> <ANCHOR_CHAINS> <REDUCTION_CHAINS> <CUTOFF> <STRUCTURE_DIR> <OUTPUT_CSV> [--aligned-output-dir <OUTPUT_DIR>]
```
- `REFERENCE_PDB`: path to the reference structure used for alignment
- `ANCHOR_CHAINS`: concatenated chain identifiers used as the alignment anchor (for example `AB` or `C`)
- `REDUCTION_CHAINS`: concatenated chain identifiers whose aligned atoms are averaged into one point per model
- `CUTOFF`: complete-linkage cutoff (Å) applied to the reduced 3D points
- `--aligned-output-dir`: optional directory where the aligned reference and aligned models are written
- Output columns: `point_x`, `point_y`, `point_z`, `cluster_id`, and `Models`
- When `--aligned-output-dir` is not provided, Massif reuses cached reduced coordinates from the existing structured CSV when possible

### `distances`
Measure minimal distances between every pair of chains and optionally retain a subset.
```bash
massif distances <FILENAME> <CHAIN_PAIRS> <STRUCTURE_DIR> <OUTPUT_CSV>
```
- `FILENAME`: reserved for future use (currently ignored)
- `CHAIN_PAIRS`: comma-separated list (for example `AB,AC,BC`); each pair becomes a CSV column
- Records minimal heavy-atom distances in Å

### `scoring`
Placeholder for future scoring pipelines.
```bash
massif scoring <STRUCTURE_DIR> <OUTPUT_CSV>
```
- Currently returns a vector of `1.0` for each model and does not write extra columns

## Output Layout
- `<OUTPUT_CSV>_alternative.csv`: structured report with stable column ordering that merges new results with previous runs
- Aligned structures are written to the provided `OUTPUT_DIR` for `fit` and to `--aligned-output-dir` for `cluster`

