Metadata-Version: 2.4
Name: mmore
Version: 1.2
Summary: mmore: Scalable multimodal document extraction pipeline for custom RAG integration.
Author-email: Alexandre Sallinen <alexandre.sallinen@epfl.ch>, Paul Teiletche <paul.teiletche@epfl.ch>, Marc-Antoine Allard <marc-antoine.allard@epfl.ch>, Stefan Krsteski <stefan.krsteski@epfl.ch>, David Kalajdzic <david.kalajdzic@epfl.ch>, Michael Zhang <michael.zhang@epfl.ch>, Matthias Meyer <matthias.meyer@sdsc.ethz.ch>, Fabrice Nemo <fabrice.nemo@epfl.ch>, Charlotte Meyer <charlotte.meyer@epfl.ch>, Grieder Lea <lea.grieder@epfl.ch>, Matthew Meyer <matthew.meyer@epfl.ch>, Achille Triomphe <achille.triomphe@epfl.ch>, Victor Zablocki <victor.zablocki@epfl.ch>, Adam Chahed Ouazzani <adam.chahedouazzani@epfl.ch>, Omar Ziyad Azgaoui <omar.azgaoui@epfl.ch>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy<2,>=1.26; python_version < "3.12"
Requires-Dist: numpy>=2.0; python_version >= "3.12"
Requires-Dist: pandas>=2.1
Requires-Dist: Pillow
Requires-Dist: pydantic>=2.6
Requires-Dist: click>=8.1.7
Requires-Dist: dacite>=1.8
Requires-Dist: validators>=0.28
Requires-Dist: python-dotenv>=1.0
Requires-Dist: typing_extensions<5.0,>=4.15.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: setuptools<81
Provides-Extra: process
Requires-Dist: transformers>=4.44; extra == "process"
Requires-Dist: PyMuPDF; extra == "process"
Requires-Dist: marker-pdf>=1.6; extra == "process"
Requires-Dist: surya-ocr>=0.8.3; extra == "process"
Requires-Dist: moviepy>=2.0; extra == "process"
Requires-Dist: mammoth>=1.8; extra == "process"
Requires-Dist: markdownify>=0.12; extra == "process"
Requires-Dist: markdown>=3.5; extra == "process"
Requires-Dist: python-docx; extra == "process"
Requires-Dist: python-pptx; extra == "process"
Requires-Dist: openpyxl>=3.1; extra == "process"
Requires-Dist: requests>=2.31; extra == "process"
Requires-Dist: trafilatura>=1.12; extra == "process"
Requires-Dist: clean-text; extra == "process"
Requires-Dist: Unidecode; extra == "process"
Requires-Dist: chonkie<1,>=0.2.1; extra == "process"
Requires-Dist: langdetect>=1.0.9; extra == "process"
Requires-Dist: argostranslate; extra == "process"
Requires-Dist: langid; extra == "process"
Requires-Dist: dask[distributed]>=2025.2.0; extra == "process"
Requires-Dist: docx2pdf; extra == "process"
Requires-Dist: lxml_html_clean; extra == "process"
Requires-Dist: beautifulsoup4>=4.12; extra == "process"
Requires-Dist: xlrd>=2.0.1; extra == "process"
Requires-Dist: py7zr>=0.22; extra == "process"
Requires-Dist: rarfile>=4.1; extra == "process"
Requires-Dist: fasteners>=0.19; extra == "process"
Requires-Dist: google-auth>=2.28; extra == "process"
Requires-Dist: google-api-python-client>=2.120; extra == "process"
Requires-Dist: datatrove>=0.3; python_version < "3.12" and extra == "process"
Requires-Dist: datatrove>=0.7; python_version >= "3.12" and extra == "process"
Requires-Dist: colpali-engine>=0.3; extra == "process"
Requires-Dist: bokeh; extra == "process"
Provides-Extra: index
Requires-Dist: pymilvus[milvus-lite]==2.6.6; extra == "index"
Requires-Dist: pymilvus-model>=0.3.2; extra == "index"
Requires-Dist: milvus-model>=0.2.12; extra == "index"
Requires-Dist: langchain-milvus>=0.1.8; extra == "index"
Requires-Dist: sentence-transformers; extra == "index"
Requires-Dist: transformers>=4.44; extra == "index"
Requires-Dist: scipy>=1.8; extra == "index"
Provides-Extra: rag
Requires-Dist: mmore[index]; extra == "rag"
Requires-Dist: langchain>=0.3; extra == "rag"
Requires-Dist: langchain-anthropic>=0.3; extra == "rag"
Requires-Dist: langchain-aws>=0.2; extra == "rag"
Requires-Dist: langchain-cohere>=0.3; extra == "rag"
Requires-Dist: langchain-community>=0.3; extra == "rag"
Requires-Dist: langchain-huggingface>=0.1; extra == "rag"
Requires-Dist: langchain-mistralai>=0.2; extra == "rag"
Requires-Dist: langchain-openai>=0.3; extra == "rag"
Requires-Dist: cohere>=5.0; extra == "rag"
Requires-Dist: ragas>=0.2; extra == "rag"
Requires-Dist: datasets>=4.0; extra == "rag"
Requires-Dist: accelerate>=0.30; extra == "rag"
Requires-Dist: nltk>=3.9; extra == "rag"
Provides-Extra: api
Requires-Dist: fastapi[standard]>=0.110; extra == "api"
Requires-Dist: uvicorn>=0.29; extra == "api"
Requires-Dist: starlette>=0.36; extra == "api"
Requires-Dist: httpx>=0.27; extra == "api"
Requires-Dist: requests>=2.31; extra == "api"
Requires-Dist: pymongo>=4.6; extra == "api"
Requires-Dist: motor>=3.5; extra == "api"
Provides-Extra: all
Requires-Dist: mmore[api,process,rag]; extra == "all"
Provides-Extra: cpu
Requires-Dist: torch>=2.7.0; extra == "cpu"
Requires-Dist: torchvision; extra == "cpu"
Provides-Extra: cu126
Requires-Dist: torch>=2.7.0; extra == "cu126"
Requires-Dist: torchvision; extra == "cu126"
Provides-Extra: websearch
Requires-Dist: tavily-python>=0.3.0; extra == "websearch"
Requires-Dist: ddgs>=6.0; extra == "websearch"
Provides-Extra: dev
Requires-Dist: pytest>=8.3.4; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: pyright; extra == "dev"
Dynamic: license-file

<h1 align="center">

![image](https://raw.githubusercontent.com/swiss-ai/mmore/master/mmore_logo.jpg)

</h1>

<p align="center">
  <img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License">
  <img src="https://img.shields.io/github/v/release/swiss-ai/mmore" alt="Release">
  <a href="https://openreview.net/forum?id=6j1HjfIdKn">
    <img src="https://img.shields.io/badge/paper-OpenReview-9cf" alt="Paper">
  </a>
</p>

####  Massive Multimodal Open RAG & Extraction

MMORE is an open-source, end-to-end pipeline to ingest, process, index, and retrieve knowledge from heterogeneous files: PDFs, Office docs, spreadsheets, emails, images, audio, video, and web pages. It standardizes content into a unified multimodal format, supports distributed CPU/GPU processing, and provides hybrid dense+sparse retrieval with an integrated RAG service (CLI, APIs). 

👉 Read the paper for more details (OpenReview): [MMORE: Massive Multimodal Open RAG & Extraction](https://openreview.net/forum?id=6j1HjfIdKn)

## :bulb: Quickstart

### Installation

#### (Step 0 – Install system dependencies)

Our package requires system dependencies. This snippet will take care of installing them for Linux!

```bash
sudo apt update
sudo apt install -y ffmpeg libsm6 libxext6 libnss3 \
  libxi6 libxrandr2 libxcomposite1 libxcursor1 libxdamage1 \
  libxext6 libxfixes3 libxrender1 libasound2 libatk1.0-0 libgtk-3-0 libreoffice \
  libpango-1.0-0 libpangoft2-1.0-0 weasyprint
```

:warning: **On Ubuntu 24.04, replace `libasound2` with `libasound2t64`. You may also need to add the repository for Ubuntu 20.04 focal to have access to a few of the sources (e.g. create `/etc/apt/sources.list.d/mmore.list` with the contents `deb http://cz.archive.ubuntu.com/ubuntu focal main universe`).**

For MacOS, use instead:

```bash
brew update
brew install ffmpeg gtk+3 pango cairo \
  gobject-introspection libffi pkg-config libx11 libxi \
  libxrandr libxcomposite libxcursor libxdamage libxext \
  libxrender atk libreoffice weasyprint
```

If `weasyprint` fails to find GTK or Cairo, also run:

```bash
brew install cairo pango gdk-pixbuf libffi
uv pip install weasyprint
```

#### Step 1 – Install MMORE

Dependencies are split by pipeline stage. Install only what you need:

| Extra | What it includes |
|---|---|
| `process` | mmore's processing pipeline |
| `index` | mmore's indexing pipeline |
| `rag` | mmore's RAG pipeline (includes `index`) |
| `api` | FastAPI servers |
| `all` | Everything above |
| `websearch` | Web search pipeline (DuckDuckGo + optional Tavily) |
| `cpu` | PyTorch (CPU) + torchvision, for a CPU-only setup |
| `cu126` | PyTorch (CUDA 12.6) + torchvision, for a GPU setup |

**Full install (CPU):**

```bash
uv pip install "mmore[all,cpu]"
```

**Full install (GPU — CUDA 12.6):**

```bash
uv pip install "mmore[all,cu126]"
```

**Partial install example (processing only):**

```bash
uv pip install "mmore[process,cpu]"
```

> :warning: This package requires many big dependencies, so it is recommended to install with `uv` to handle `pip` installations. [Check our tutorial on uv](https://github.com/swiss-ai/mmore/blob/master/docs/uv.md).

> :warning: **Check the instructions for contributors directly at [`docs/for_devs.md`](./docs/for_devs.md)**

### Minimal Example

You can use our predefined CLI commands to execute parts of the pipeline. Note that you might need to prepend `python -m` to the command if the package does not properly create bash aliases.

```bash
# Run processing
python -m mmore process --config-file examples/process/config.yaml
python -m mmore postprocess --config-file examples/postprocessor/config.yaml --input-data examples/process/outputs/merged/merged_results.jsonl

# Run indexer
python -m mmore index --config-file examples/index/config.yaml --documents-path examples/postprocessor/outputs/merged/results.jsonl

# Run RAG
python -m mmore rag --config-file examples/rag/config.yaml
```

You can also use our package in python code as shown here:

```python
from mmore.process.processors.pdf_processor import PDFProcessor
from mmore.process.processors.base import ProcessorConfig
from mmore.type import MultimodalSample

pdf_file_paths = ["/path/to/examples/sample_data/pdf/calendar.pdf"] #write here the full path, not a relative path
out_file = "/path/to/examples/process/outputs/example.jsonl"

pdf_processor_config = ProcessorConfig(custom_config={"output_path": "examples/process/outputs"})
pdf_processor = PDFProcessor(config=pdf_processor_config)
result_pdf = pdf_processor.process_batch(pdf_file_paths, False, 1) # args: file_paths, fast mode (True/False), num_workers

MultimodalSample.to_jsonl(out_file, result_pdf)
```

---

### Usage

To launch the MMORE pipeline, follow the specialised instructions in the docs.

![The MMORE pipelines architecture](https://github.com/user-attachments/assets/0cd61466-1680-43ed-9d55-7bd483a04a09)


1. **:page_facing_up: Input Documents**
   Upload your multimodal documents (PDFs, videos, spreadsheets, and m(m)ore) into the pipeline.

2. [**:mag: Process**](https://github.com/swiss-ai/mmore/blob/master/docs/process.md)
   Extracts and standardizes text, metadata, and multimedia content from diverse file formats. Easily extensible! You can add your own processors to handle new file types.
   *Supports fast processing for specific types.*

3. [**:file_folder: Index**](https://github.com/swiss-ai/mmore/blob/master/docs/index.md)
   Organizes extracted data into a **hybrid retrieval-ready Vector Store DB**, combining dense and sparse indexing through [Milvus](https://milvus.io/). Your vector DB can also be remotely hosted and then you only have to provide a standard API. There is also an [HTTP Index API](https://github.com/swiss-ai/mmore/blob/master/docs/index_api.md) for adding new files on the fly with HTTP requests.

4. [**:robot: RAG**](https://github.com/swiss-ai/mmore/blob/master/docs/rag.md)
   Use the indexed documents inside a **Retrieval-Augmented Generation (RAG) system**  that provides a [LangChain](https://www.langchain.com/) interface. Plug in any LLM with a compatible interface or add new ones through an easy-to-use interface.
   *Supports API hosting or local inference.*

5. [**:globe_with_meridians: Web Search**](https://github.com/swiss-ai/mmore/blob/master/docs/websearch.md)
   Augments RAG answers with live web search results using an iterative sub-query loop.
   DuckDuckGo is the default provider (free, no API key needed). Tavily is available as an optional higher-quality provider.
    ```bash
      # Install web search dependencies
      pip install "mmore[rag,websearch]"

      # Optional: use Tavily instead of DuckDuckGo
      export TAVILY_API_KEY=your_key_here
    ```

6. **:tada: Evaluation**
   *Coming soon*
   An easy way to evaluate the performance of your RAG system using Ragas.

See [the `/docs` directory](https://github.com/swiss-ai/mmore/blob/master/docs) for additional details on each modules and hands-on tutorials on parts of the pipeline.


#### :construction: Supported File Types

| **Category**      | **File Types**                           | **Supported Device**      |  **Fast Mode**      |
|--------------------|------------------------------------------|--------------------------| --------------------------|
| **Text Documents** | DOCX, MD, PPTX, XLSX, TXT, EML           | CPU                      | :x:
| **PDFs**           | PDF                                     | GPU/CPU                  | :white_check_mark:
| **Media Files**    | MP4, MOV, AVI, MKV, MP3, WAV, AAC       | GPU/CPU                  | :white_check_mark:
| **Web Content**    | HTML                                    | CPU                      | :x:

## License

This project is licensed under the Apache 2.0 License, see the [LICENSE :mortar_board:](LICENSE) file for details.

## Cite MMORE

If you use MMORE in your research, please cite the paper:
```
@inproceedings{sallinenm,
  title={M (M) ORE: Massive Multimodal Open RAG \& Extraction},
  author={Sallinen, Alexandre and Krsteski, Stefan and Teiletche, Paul and Marc-Antoine, Allard and Lecoeur, Baptiste and Zhang, Michael and Nemo, Fabrice and Kalajdzic, David and Meyer, Matthias and Hartley, Mary-Anne},
  booktitle={Championing Open-source DEvelopment in ML Workshop@ ICML25}
}
```

<p align="center">
  <a href="https://www.star-history.com/#swiss-ai/mmore&Date">
     <picture>
     <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=swiss-ai/mmore&type=Date&theme=dark" />
     <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=swiss-ai/mmore&type=Date" />
     <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=swiss-ai/mmore&type=Date" />
   </picture>
  </a>
</p>
