Metadata-Version: 2.4
Name: wsistream
Version: 0.1.3
Summary: Modular online patch streaming from whole-slide images for computational pathology
Author: Ramon Kaspar
License: MIT
Project-URL: Homepage, https://github.com/RamonKaspar/wsistream
Project-URL: Documentation, https://ramonkaspar.github.io/wsistream
Project-URL: Repository, https://github.com/RamonKaspar/wsistream
Project-URL: Issues, https://github.com/RamonKaspar/wsistream/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: Pillow>=9.0
Requires-Dist: scikit-image>=0.20
Requires-Dist: opencv-python-headless>=4.7
Requires-Dist: requests>=2.28
Requires-Dist: tqdm>=4.60
Provides-Extra: openslide
Requires-Dist: openslide-python>=1.2; extra == "openslide"
Requires-Dist: openslide-bin>=4.0; extra == "openslide"
Provides-Extra: tiffslide
Requires-Dist: tiffslide>=2.0; extra == "tiffslide"
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: all
Requires-Dist: openslide-python>=1.2; extra == "all"
Requires-Dist: openslide-bin>=4.0; extra == "all"
Requires-Dist: tiffslide>=2.0; extra == "all"
Requires-Dist: albumentations>=1.3; extra == "all"
Requires-Dist: matplotlib>=3.7; extra == "all"
Requires-Dist: torch>=2.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: mkdocs-material>=9.0; extra == "dev"
Requires-Dist: openslide-python>=1.2; extra == "dev"
Requires-Dist: openslide-bin>=4.0; extra == "dev"
Requires-Dist: tiffslide>=2.0; extra == "dev"
Requires-Dist: albumentations>=1.3; extra == "dev"
Requires-Dist: matplotlib>=3.7; extra == "dev"
Requires-Dist: torch>=2.0; extra == "dev"
Dynamic: license-file

<p align="center">
    <img src="https://github.com/RamonKaspar/wsistream/blob/main/docs/assets/logo.svg" alt="wsistream" width="450">
</p>

<p align="center">
    <em>Modular online patch streaming from whole-slide images for computational pathology.</em>
</p>

<p align="center">
    <a href="https://pypi.org/project/wsistream/"><img alt="PyPI" src="https://img.shields.io/pypi/v/wsistream"></a>
    <a href="https://pypi.org/project/wsistream/"><img alt="Python" src="https://img.shields.io/pypi/pyversions/wsistream"></a>
    <a href="https://github.com/RamonKaspar/wsistream/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/RamonKaspar/wsistream"></a>
    <a href="https://ramonkaspar.github.io/wsistream"><img alt="Docs" src="https://img.shields.io/badge/docs-GitHub%20Pages-blue"></a>
</p>

Stream patches directly from WSIs during training — no disk pre-extraction, no storage overhead. Every component is pluggable: backends, tissue detectors, samplers, filters, transforms, dataset adapters.

## Install

```bash
pip install "wsistream[openslide]"   # with OpenSlide
pip install "wsistream[tiffslide]"   # with TiffSlide (pure Python)
pip install "wsistream[torch]"       # add PyTorch integration (WsiStreamDataset, DDP)
pip install "wsistream[all]"         # everything (OpenSlide + TiffSlide + PyTorch + albumentations + matplotlib)
```

For development:

```bash
git clone https://github.com/RamonKaspar/wsistream.git
cd wsistream
pip install -e ".[dev]"
```

## Documentation

Full documentation: [ramonkaspar.github.io/wsistream](https://ramonkaspar.github.io/wsistream)

To build locally:

```bash
pip install mkdocs-material
mkdocs serve          # local preview at http://127.0.0.1:8000
```

## How it works

Each slide goes through a fixed pipeline:

1. **Open slide**: via an explicit backend (`OpenSlideBackend` or `TiffSlideBackend`)
2. **Detect tissue**: run a `TissueDetector` on a low-res thumbnail to get a binary mask
3. **Sample coordinates**: a `PatchSampler` proposes (x, y) locations within tissue regions
4. **Extract patch**: read the pixel data from the slide at each coordinate
5. **Filter patch**: a `PatchFilter` accepts or rejects the tile based on its pixels
6. **Transform patch**: apply augmentations (`HEDColorAugmentation`, `RandomFlipRotate`, etc.)
7. **Yield result**: `PatchResult` with image, coordinates, tissue fraction, and metadata

## Quick start

```python
from wsistream.pipeline import PatchPipeline
from wsistream.backends import OpenSlideBackend
from wsistream.tissue import CLAMTissueDetector
from wsistream.sampling import RandomSampler
from wsistream.filters import HSVPatchFilter
from wsistream.transforms import ComposeTransforms, HEDColorAugmentation, RandomFlipRotate, ResizeTransform
from wsistream.datasets import TCGAAdapter

pipeline = PatchPipeline(
    slide_paths="/data/tcga",  # directory or list of files
    backend=OpenSlideBackend(),
    tissue_detector=CLAMTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
    patch_filter=HSVPatchFilter(min_pixel_fraction=0.6),
    transforms=ComposeTransforms(transforms=[
        HEDColorAugmentation(sigma=0.05),
        RandomFlipRotate(),
        ResizeTransform(target_size=224),
    ]),
    dataset_adapter=TCGAAdapter(),
    pool_size=8,
    patches_per_slide=100,
    cycle=True,
)

for result in pipeline:
    print(result.image.shape)                # (224, 224, 3) uint8
    print(result.coordinate.mpp)             # ~0.5
    print(result.tissue_fraction)            # 0.87
    print(result.slide_metadata.patient_id)  # TCGA-3L-AA1B
```

## Pool-based slide interleaving

The pipeline keeps `pool_size` slides open simultaneously and takes `patches_per_slide` patches from each before closing it and opening the next. With `cycle=True`, slides are re-queued for infinite streaming. Set `patches_per_visit` (default 1) to read multiple patches from the same slide before round-robining, which can significantly improve I/O throughput on network filesystems.

## PyTorch integration

`wsistream.torch` provides `WsiStreamDataset` (an `IterableDataset`), `MonitoredLoader` for throughput tracking, and `partition_slides_by_rank` for DDP. Worker-level slide partitioning is handled automatically.

```python
from torch.utils.data import DataLoader
from wsistream.backends import OpenSlideBackend
from wsistream.sampling import RandomSampler
from wsistream.tissue import OtsuTissueDetector
from wsistream.torch import WsiStreamDataset, partition_slides_by_rank

my_slides = partition_slides_by_rank("/data/tcga", rank=rank, world_size=world_size)

dataset = WsiStreamDataset(
    slide_paths=my_slides,
    backend=OpenSlideBackend(),
    tissue_detector=OtsuTissueDetector(),
    sampler=RandomSampler(patch_size=256, num_patches=-1, target_mpp=0.5),
)

loader = DataLoader(dataset, batch_size=64, num_workers=4, pin_memory=True)
loader_iter = iter(loader)

for step in range(total_steps):
    batch = next(loader_iter)
    images = batch["image"].to(device, non_blocking=True)  # (B, 3, H, W) float32
```
