Metadata-Version: 2.4
Name: geocif
Version: 0.4.399
Summary: Models to visualize and forecast crop conditions and yields
Author-email: Ritvik Sahajpal <ritvik@umd.edu>
License: MIT
Project-URL: Homepage, https://ritviksahajpal.github.io/yield_forecasting/
Keywords: geocif
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boruta>=0.4.3
Requires-Dist: catboost>=1.2.8
Requires-Dist: fiona
Requires-Dist: gdal==3.10.2; sys_platform == "win32"
Requires-Dist: gdal==3.11.0; sys_platform != "win32"
Requires-Dist: pyeogpr>=2.4.7
Requires-Dist: pyproj
Requires-Dist: rasterio
Requires-Dist: rtree
Requires-Dist: shap>=0.48.0
Requires-Dist: shapely
Requires-Dist: optuna
Requires-Dist: xarray>=2026.2.0
Requires-Dist: pooch>=1.8.0
Requires-Dist: arrow>=1.4.0
Requires-Dist: icclim>=7.0.4
Requires-Dist: geoprepare>=0.6.129
Requires-Dist: logzero>=1.7.0
Requires-Dist: geopandas>=1.1.2
Requires-Dist: tabpfn>=6.4.1
Requires-Dist: tabicl>=2.0.2
Requires-Dist: statsmodels>=0.14.6
Requires-Dist: palettable>=3.3.3
Requires-Dist: seaborn>=0.13.2
Requires-Dist: scikit-misc>=0.5.2
Requires-Dist: setuptools<81
Requires-Dist: choix>=0.3.4
Requires-Dist: scienceplots>=2.0.0
Provides-Extra: dashboard
Requires-Dist: panel>=1.4.0; extra == "dashboard"
Requires-Dist: hvplot>=0.10.0; extra == "dashboard"
Dynamic: license-file

# geocif

[![image](https://img.shields.io/pypi/v/geocif.svg)](https://pypi.python.org/pypi/geocif)

**Models to visualize and forecast crop conditions and yields**

Generate Climatic Impact-Drivers (CIDs) from Earth Observation (EO) data, build ML yield forecasting models, and produce agmet condition monitoring plots.

[Climatic Impact-Drivers for Crop Yield Assessment at NASA Harvest](https://www.loom.com/share/5c2dc62356c6406193cd9d9725c2a6a9)

-   Free software: MIT license
-   Documentation: https://ritviksahajpal.github.io/yield_forecasting/


## Setup

### Requirements

- Python 3.11+
- [uv](https://docs.astral.sh/uv/getting-started/installation/)

### Install

```bash
cd geocif                   # project root (where pyproject.toml lives)
uv sync                     # creates .venv and installs all dependencies
```

On **Windows**, uv automatically pulls pre-built geospatial wheels (GDAL, rasterio, fiona, shapely, pyproj, rtree) from the URLs in `[tool.uv.sources]`. On **Linux/macOS**, those entries are skipped (platform marker) and packages are installed from PyPI.

To activate the environment:

```bash
# Windows
.venv\Scripts\activate

# Linux/macOS
source .venv/bin/activate
```

### Fresh reinstall

```bash
rm -rf .venv && uv sync
```

## Config files

| File | Purpose | Used by |
|------|---------|---------|
| [`geobase.txt`](#geobasetxt) | Paths, shapefile column mappings | both |
| [`countries.txt`](#countriestxt) | Per-country config (boundary files, admin levels, seasons, crops) | both |
| [`crops.txt`](#cropstxt) | Crop masks, calendar categories (EWCM, AMIS) | both |
| [`geoextract.txt`](#geoextracttxt) | Extraction-only settings (method, threshold, parallelism) | geoprepare |
| [`geocif.txt`](#geociftxt) | Indices/ML/agmet settings, country overrides, runtime selections | geocif |

## Usage

**Order matters:** Config files are loaded left-to-right. When the same key appears in multiple files, the last file wins. The tool-specific file (`geoextract.txt` or `geocif.txt`) must be last so its `[DEFAULT]` values (countries, method, etc.) override the shared defaults in `countries.txt`.

```python
config_dir = "/path/to/config"  # full path to your config directory

cfg_geoprepare = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geoextract.txt"]
cfg_geocif = [f"{config_dir}/geobase.txt", f"{config_dir}/countries.txt", f"{config_dir}/crops.txt", f"{config_dir}/geocif.txt"]
```

### geoprepare (download, extract, merge)

```python
from geoprepare import geodownload
geodownload.run([f"{config_dir}/geobase.txt"])

from geoprepare import geoextract
geoextract.run(cfg_geoprepare)

from geoprepare import geomerge
geomerge.run(cfg_geoprepare)
```

### geocif (indices, ML, agmet, analysis, experiments)

```python
from geocif import indices_runner
indices_runner.run(cfg_geocif)

from geocif import geocif_runner
geocif_runner.run(cfg_geocif)

from geocif.agmet import geoagmet
geoagmet.run(cfg_geocif)

from geocif import analysis
analysis.run(cfg_geocif)

from geocif import experiments
experiments.run(cfg_geocif, n_trials=30)

from geocif import yield_outlook
yield_outlook.run(cfg_geocif)  # uses config defaults (10 years, mean)
# yield_outlook.run(cfg_geocif, current_year=2026, n_years=10, aggregation="median")
```

### ML models

geocif supports the following model types (configured via `models` in `[DEFAULT]`):

| Model | Key | Type |
|-------|-----|------|
| CatBoost | `catboost` | Gradient boosting |
| XGBoost | `xgboost` | Gradient boosting |
| TabPFN | `tabpfn` | Prior-fitted network |
| TabICL | `tabicl` | In-context learning |
| NGBoost | `ngboost` | Natural gradient boosting |
| YDF | `ydf` | Yggdrasil decision forests |
| Oblique RF | `oblique` | Oblique random forest |
| Cubist | `cubist` | Rule-based regression |
| MERF | `merf` | Mixed effects random forest |
| Linear | `linear` | LassoCV / LogisticRegressionCV |
| GAM | `gam` | Generalized additive model |
| GeoSpaNN | `geospaNN` | Geospatial neural network |
| Median | `median` | Median baseline |
| Analog | `analog` | Analogous year baseline |

### Feature selection methods

Configured via `feature_selection` in `[ML]`:

`none`, `SelectKBest`, `BorutaPy`, `Leshy`, `gOMP`, `RFECV`, `RFE`, `lasso`, `mrmr`, `SHAP`, `stabl`, `PowerShap`, `BorutaShap`, `Genetic`, `feature_engine`, `multi`

### Spatial neighbor features

Optional GraphSAGE-style preprocessing that computes yield-correlation-weighted averages of neighboring regions' features. Enabled via `[ML]`:

```ini
use_spatial_neighbors = True
spatial_neighbor_method = knn   ; knn or full
spatial_neighbor_k = 5          ; number of nearest neighbors
```

For each admin region, the neighbor graph is built from training data using haversine distances and Pearson yield correlations as edge weights. Neighbor-aggregated features are added as `nbr_*` columns and flow through standard feature selection.

### Experiments output

The experiments runner writes to a dedicated DB and analysis folder under `dir_output`:

```
{dir_output}/
└── ml/
    ├── db/
    │   └── experiments_{MMMM_DD_YYYY_HH}H.db
    │
    └── analysis/
        └── {MMMM_DD_YYYY}/
            ├── experiments/                            # Experiment 0 (model comparison)
            │   ├── experiment_metrics.csv
            │   ├── heatmap_models.png
            │   ├── boxplot_models.png
            │   ├── regional_mape_models_{country}.png
            │   ├── error_distribution_models.png
            │   └── metric_comparison.png
            │
            └── optimization/                           # Optuna hyperparameter search
                ├── optuna_trials.csv
                ├── best_params.csv
                ├── convergence.png
                ├── optimization_history.png
                ├── param_importances.png
                └── parallel_coordinate.png
```

### Outlook output

The yield outlook runner produces a diverging choropleth map showing current forecast yield as a percentage of the historical mean/median prediction per region, plus a combined CSV.

```
{dir_output}/
└── ml/
    └── analysis/
        └── {MMMM_DD_YYYY}/
            └── outlook/
                ├── yield_outlook_{country}_{crop}_{model}_{stage}_{year}.png
                └── yield_outlook_{year}.csv
```

## Config file documentation

### geobase.txt

Shared paths and dataset settings. All directory paths are derived from `dir_base`.

```ini
[PATHS]
dir_base = /gpfs/data1/cmongp1/GEO

dir_inputs = ${dir_base}/inputs
dir_logs = ${dir_base}/logs
dir_download = ${dir_inputs}/download
dir_intermed = ${dir_inputs}/intermed
dir_metadata = ${dir_inputs}/metadata
dir_condition = ${dir_inputs}/crop_condition
dir_crop_inputs = ${dir_condition}/crop_t20

dir_boundary_files = ${dir_metadata}/boundary_files
dir_crop_calendars = ${dir_metadata}/crop_calendars
dir_crop_masks = ${dir_metadata}/crop_masks
dir_images = ${dir_metadata}/images
dir_production_statistics = ${dir_metadata}/production_statistics

dir_output = ${dir_base}/outputs

[DATASETS]
datasets = ['CHIRPS', 'CPC', 'NDVI', 'ESI', 'NSIDC', 'AEF']
```

### countries.txt

Single source of truth for per-country config. Shared by both geoprepare and geocif.

```ini
[DEFAULT]
boundary_file = gaul1_asap_v04.shp
admin_level = admin_1
seasons = [1]
crops = ['maize']
category = AMIS
use_cropland_mask = False
calendar_file = crop_calendar.csv

; AMIS countries (inherit from DEFAULT, override crops if needed)
[argentina]
crops = ['soybean', 'winter_wheat', 'maize']

; EWCM countries (full per-country config)
[kenya]
category = EWCM
admin_level = admin_1
seasons = [1, 2]
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']

[malawi]
category = EWCM
admin_level = admin_2
use_cropland_mask = True
boundary_file = adm_shapefile.gpkg
calendar_file = EWCM_2025-04-21.xlsx
crops = ['maize']
```

### crops.txt

Crop mask filenames and calendar category definitions.

```ini
; Crop masks
[maize]
mask = Percent_Maize.tif

[winter_wheat]
mask = Percent_Winter_Wheat.tif

[sorghum]
mask = cropland_v9.tif

; Calendar categories
[EWCM]
use_cropland_mask = True
calendar_file = EWCM_2026-01-05.xlsx
crops = ['maize', 'sorghum', 'millet', 'rice', 'winter_wheat', 'teff']
eo_model = ['aef', 'nsidc_surface', 'nsidc_rootzone', 'ndvi', 'cpc_tmax', 'cpc_tmin', 'chirps', 'chirps_gefs', 'esi_4wk']

[AMIS]
calendar_file = AMISCM_2026-01-05.xlsx
```

### geoextract.txt

Extraction-only settings for geoprepare. Loaded last so its `[DEFAULT]` overrides shared defaults.

```ini
[DEFAULT]
method = JRC
redo = False
threshold = True
floor = 20
ceil = 90
countries = ["malawi"]
forecast_seasons = [2022]

[PROJECT]
parallel_extract = True
parallel_merge = False
```

### geocif.txt

Indices, ML, and agmet settings for geocif. Country overrides go here when geocif needs different values than countries.txt (e.g., a subset of crops).

```ini
[AGMET]
eo_plot = ['ndvi', 'chirts_era5_tmax', 'chirts_era5_tmin', 'chirps', 'esi_4wk', 'nsidc_surface', 'nsidc_rootzone']
logo_harvest = harvest.png
logo_geoglam = geoglam.png

; Country overrides (only where geocif differs from countries.txt)
[ethiopia]
crops = ['winter_wheat']

[bangladesh]
crops = ['rice']
admin_level = admin_2
boundary_file = bangladesh.shp

; ML model definitions
[catboost]
ML_model = True

[analog]
ML_model = False

[ML]
model_type = REGRESSION
target = Yield (tn per ha)
feature_selection = gOMP
cluster_strategy = single
check_yield_trend = False
use_spatial_neighbors = True
spatial_neighbor_method = knn
spatial_neighbor_k = 5
lag_yield_as_feature = True
lag_years = 3
median_yield_as_feature = False
median_years = 5
include_lat_lon_as_feature = False
panel_model = True
cat_features = ["Harvest Year", "Region_ID", "Region"]
outlook_n_years = 10        ; Number of historical years for yield outlook comparison
outlook_aggregation = mean  ; mean or median

[LOGGING]
log_level = INFO

[DEFAULT]
data_source = harvest
method = monthly_r
project_name = geocif
countries = ["kenya"]
crops = ['maize']
admin_level = admin_1
models = ['catboost']
seasons = [1]
threshold = True
floor = 20
```

### FLDAS forecast overlay

When FLDAS columns are present in the merged data (e.g. `fldas_tair_tavg_lead0` through `_lead5`), agmet plots automatically overlay forecast dots on matching panels:

| FLDAS variable | Target panel |
|---|---|
| `fldas_tair_tavg` | Temperature |
| `fldas_totalprecip_tavg` | Daily precipitation |
| `fldas_soilmoist_tavg` | Soil moisture (surface) |

Each lead time (0–5) appears as a diamond marker with decreasing opacity (lead 0 = most opaque). Dots beyond the harvest date are suppressed. No config changes are needed — detection is automatic.

## Release

To publish a new version to PyPI:

1. Bump `__version__` in `geocif/__init__.py` and `version` in `pyproject.toml`
2. Build and upload:
   ```bash
   uv build
   uvx twine upload dist/geocif-<version>*
   ```
3. Commit:
   ```bash
   git add geocif/__init__.py pyproject.toml
   git commit -m "Bump to <version>"
   ```

## Credits

This project was supported by NASA Applied Sciences Grant No. 80NSSC17K0625 through the NASA Harvest Consortium, and the NASA Acres Consortium under NASA Grant #80NSSC23M0034.
