Metadata-Version: 2.4
Name: projectPCA
Version: 0.3.1
Summary: Project (ancient) human genomes onto pre-computed standard PCA
Author-email: Harald Ringbauer <harald_ringbauer@eva.mpg.de>
Maintainer-email: Harald Ringbauer <harald_ringbauer@eva.mpg.de>
License-Expression: GPL-3.0-or-later
Keywords: PCA,ancient DNA,eigenstrat,low coverage
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: matplotlib
Provides-Extra: plink
Requires-Dist: bed-reader; extra == "plink"
Provides-Extra: interactive
Requires-Dist: plotly; extra == "interactive"
Dynamic: license-file

# projectPCA
Project genomes onto pre-computed principal components widely used in ancient DNA. Enables fast analysis without re-computing the principal components. The software accepts ancient DNA data in `eigenstrat` or `PLINK` format as input. No modern samples are required, as the packages include the pre-computed PCA weights and PC coordinates for relevant modern samples (based on publicly available Human Origin array data).

## Installation
The package `projectPCA`is available as a Python package via `pip`. To install, simply run a version of:

```
python3 -m pip install projectPCA
```

## List of available PCAs
As of early 2026, two pre-computed PCAs are officially bundled into `projectPCA`. The bracket denotes the code you can use for all this PCA.

- **HO Westeurasia (HO)**
Standard Western Eurasian PCA, which is widely used in aDNA studies. PC1 corresponds to West-East, and PC2 to North-South.

- **HO Eurasian (EUAS)**
Standard whole-Eurasian PCA, widely used in aDNA studies. Excellent to resolve West versus East Asian ancestry (on PC1). PC2 generally corresponds to North-South.


## Usage

### Project single Samples
To project onto a PCA, the key function is `project_eigenstrat`. To import it and run a single sample, use:

```
from projectPCA.run import project_eigenstrat

project_eigenstrat(es_path="/mnt/archgen/Autorun_eager/eager_outputs/TF/SUA/SUA002/genotyping/pileupcaller.double",
                   pca="HO", es_type="default")
```
This function also returns the dataframe with PCA coordinates. Note that the input path is the path of the eigenstrat files up to `.geno` but without the suffix.

The keyword `pca` denotes which PCA type to project onto (see above). 

If you want to save the figure, you can add the keyword `fig_path=""`. If this string is filled in, the program saves the resulting figure there. 
If the path ends in `.html`, the figure is saved as an interactive plot, where you can hover over the individuals to see their labels (both ancient and modern reference samples). Otherwise, the standard `matplotlib` libraries are used to plot and save the figure (including in `.png` or `.pdf` format, based on the extension you provide).

```
project_eigenstrat(es_path="/mnt/archgen/Autorun_eager/eager_outputs/TF/SUA/SUA002/genotyping/pileupcaller.double",
                   pca="EUAS", es_type="unpacked_fast", plot_bgrd_c=False, fig_path='./figs/SUA002_EUAS.html')
```

### Project multiple samples
It is also possible to project multiple samples. For this, you can use the keyword `iids=[]`. If the keyword is empty (the default), all samples in a file are projected and plotted. If you specify a list of individuals, only individuals with these IDs are projected.


### Project PLINK files
To project PLINK files, you can use the keyword `es_type="plink"`, and provide the path of the PLINK file up to the suffix:

```
project_eigenstrat(es_path="/mnt/archgen/users/hringbauer/git/EPIDEMIC/output/plink/bd_ptn_335",
                   pca="EUAS", es_type="plink", iids=[],
                   plot_bgrd_c=False, verbose=True, flip=True, 
                   fig_path='/mnt/archgen/users/hringbauer/git/projectPCA/figs/ptn335PLINK_EUAS.html')
```


@Harald Ringbauer, 2026
