Metadata-Version: 2.4
Name: parseimagenet
Version: 1.5.0
Summary: Extract ImageNet image paths by category keywords
Author-email: Reed Turgeon <turgeon.dev@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/MrT3313/Parse-ImageNet
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: jupyter; extra == "dev"
Requires-Dist: ipykernel; extra == "dev"
Requires-Dist: pytest>=7.0; extra == "dev"
Dynamic: license-file

# ParseImageNet

Extract image file paths from ImageNet by matching category keywords. Useful for creating custom subsets of ImageNet for training or evaluation.

[![PyPI Version](https://img.shields.io/pypi/v/parseimagenet)](https://pypi.org/project/parseimagenet/)
[![Python Version](https://img.shields.io/pypi/pyversions/parseimagenet)](https://pypi.org/project/parseimagenet/)
[![License](https://img.shields.io/github/license/MrT3313/Parse-ImageNet)](https://github.com/MrT3313/Parse-ImageNet/blob/main/LICENSE)
[![Downloads](https://img.shields.io/pypi/dm/parseimagenet)](https://pypi.org/project/parseimagenet/)

## [Kaggle Competition Dataset](https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data)

## Prerequisites

- Python 3.8+
- ImageNet dataset (or a subset) with the standard ILSVRC directory structure:
  ```
  ImageNet-Subset/
  ├── LOC_synset_mapping.txt
  ├── LOC_val_solution.csv
  └── ILSVRC/
      ├── ImageSets/
      │   └── CLS-LOC/
      │       ├── train_cls.txt
      │       └── val.txt
      └── Data/
          └── CLS-LOC/
              ├── train/
              │   ├── n01440764/
              │   │   ├── n01440764_10026.JPEG
              │   │   └── ...
              │   └── ...
              └── val/
                  ├── ILSVRC2012_val_00000001.JPEG
                  └── ...
  ```

## Installation

```bash
pip install parseimagenet
```

For local development:

```bash
git clone https://github.com/MrT3313/Parse-ImageNet.git
pip install -e /path/to/ParseImageNet
# ex: pip install -e /Users/mrt/Documents/MrT/code/computer-vision/ParseImageNet
```

## Usage

> [!NOTE]
>
> [Example Notebook](/DOCS/ExampleNotebook.ipynb)

### Params

| Parameter    | Type              | Default   | Alternatives                                                          | Description                                            |
|--------------|-------------------|-----------|-----------------------------------------------------------------------|--------------------------------------------------------|
| `base_path`  | `Path`            | -         | Any valid directory path                                              | Root path to the ImageNet dataset                      |
| `preset`     | `str` or `None`   | `None`    | `"birds"`, `"dogs"`, ... via `get_available_presets()`                | Predefined keyword list. `None` selects all categories |
| `keywords`   | `list` or `None`  | `None`    | Any list of strings                                                   | Custom keyword list. Overrides `preset` when provided  |
| `num_images` | `int`             | `200`     | Any positive integer                                                  | Max images to return (capped by availability)          |
| `source`     | `str`             | `"train"` | `"val"`                                                               | Data split to sample from                              |
| `silent`     | `bool`            | `True`    | `False`                                                               | Suppresses print output when enabled                   |

### Base Example

```python
from pathlib import Path
from parseimagenet import get_image_paths_by_keywords

# Set the path to your ImageNet directory
base_path = Path('/path/to/your/ImageNet-Subset')
# ex: /Users/mrt/Documents/MrT/code/computer-vision/image-bank/ImageNet-Subset

# Default: no preset, selects from all categories
image_paths = get_image_paths_by_keywords(base_path=base_path)

# image_paths is a list of Path objects
print(f"Found {len(image_paths)} images")
print(image_paths[:5])
```

### Using Presets

> [!NOTE]
>
> Presets are predefined keyword lists for common categories:

```python
from parseimagenet import get_image_paths_by_keywords # main function
from parseimagenet import get_available_presets, KEYWORD_PRESETS # helpers

# See available presets
print(get_available_presets())  # ['birds', 'dogs', 'wild_canids', 'snakes']

# Access preset keywords directly
print(KEYWORD_PRESETS["birds"])

# Use a specific preset
image_paths = get_image_paths_by_keywords(
    base_path=base_path,
    preset="birds",
    num_images=200
)
```

### Using Keywords

> [!NOTE]
> 
> Custom keywords override the preset:

> [!IMPORTANT]
>
> you can find all applicable category keywords in the `LOC_synset_mapping.txt` file

```python
image_paths = get_image_paths_by_keywords(
    base_path=base_path,
    keywords=['dog', 'puppy', 'hound'],
    num_images=100
)
```

### Using Sources

By default, images are sourced from the training set. Use `source="val"` to pull from the validation set instead:

> [!IMPORTANT]
> 
> we do not provide a fetch from the test data because the [Kaggle Competition Dataset](https://www.kaggle.com/competitions/imagenet-object-localization-challenge/data) does not provide the ground truth for the training data.

```python
image_paths = get_image_paths_by_keywords(
    base_path=base_path,
    preset="birds",
    num_images=100,
    source="val"
)
```

### Command Line

```bash
# Use default preset (birds)
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset

# Use a specific preset
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --preset birds --num_images 100

# Use custom keywords (overrides preset)
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --keywords "dog, puppy" --num_images 100

# Use validation data instead of training data
python -m parseimagenet.ParseImageNetSubset --base_path /path/to/ImageNet-Subset --preset birds --source val --num_images 100
```
