Metadata-Version: 2.4
Name: pubmedparser2
Version: 2.1.3
Summary: Download and parse pubmed publication data
License: MIT
License-File: LICENSE
Keywords: publication,network,MEDLINE,PubMed,references
Author: David Connell
Author-email: davidconnell12@gmail.com
Maintainer: David Connell
Maintainer-email: davidconnell12@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: appdirs (>=1.4.4,<2.0.0)
Requires-Dist: psutil (>=7.0.0,<8.0.0)
Project-URL: Repository, https://github.com/net-synergy/pubmedparser
Description-Content-Type: text/markdown

![PyPI - Version](https://img.shields.io/pypi/v/pubmedparser2)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pubmedparser2)

Read XML files and pull out selected values. Values to collect are
determined by paths found in a [structure file](#structure-file). The
structure file also includes a key which associates the values with a
parent element and names, which determine which file to place the
elements in.

Files can be passed as either gzipped or uncompressed XML files or from
standard in.

For more info on Pubmed's XML files see:
[pubmed\_190101\_.dtd.](https://dtd.nlm.nih.gov/ncbi/pubmed/doc/out/190101/index.html)

Usage:

``` python
import pubmedparser
import pubmedparser.ftp

# Download data
files = pubmedparser.ftp.download(range(1, 6))

# Read XML files using a YAML file to describe what data to collect.
data_dir = "file_example"
structure_file = "example/structure.yml"
results = pubmedparser.read_xml(files, structure_file, data_dir)
```

See [the example
file](https://github.com/net-synergy/pubmedparser/blob/master/example/creating_graphs.py)
for more options.

In python, the structure file can be replaced with a dictionary of
dictionaries as well.

Or, as a CLI:

``` bash
xml_read --cache-dir=cache --structure-file=structure.yml \
    data/*.xml.gz
```

## Installing with pip

``` bash
pip install pubmedparser2
```

## Building python package

Requires `zlib`.

Clone the repository and cd into the directory. Then use [poetry](https://python-poetry.org/docs) to build and install the package.

``` bash
make python
```

# Structure file

The structure file is a YAML file containing key-value pairs for
different tags and paths. There are two required keys: `root` and `key`.
`Root` provide the top-level tag, in the case of the pubmed files this
will be `PubmedArticleSet`.

``` bash
root: "/PubmedArticleSet"
```

The `/` is not strictly required as the program will ignore them, but
they are used to conform to the
[xpath](https://en.wikipedia.org/wiki/XPath) syntax (although this
program does not handle all cases for `xpath`).

Only tags below the root tag will be considered and the parsing will
terminate once the program has left the root of the tree.

`Key` is a reference tag. In the pubmed case, all data is with respect
to a publication, so the key should identify the publication the values
are linked to. The `PMID` tag is a suitable candidate.

``` bash
key: "/PubmedArticle/MedlineCitation/PMID"
```

After `root`, all paths are taken as relative to the root node.

The other name-pairs in the file determine what other items to collect.
These can either be a simple name and path, like the key, such as:

``` bash
Language: "/PubmedArticle/MedlineCitation/Article/Language"
Keywords: "/PubmedArticle/MedlineCitation/KeywordList/Keyword"
```

Or they can use a hierarchical representation to get multiple values
below a child. This is mainly used to handle lists of items where there
is an indefinite number of items below the list.

``` bash
Author: {
  root: "/PubmedArticle/MedlineCitation/Article/AuthorList",
  key: "/Author/auto_index",
  LastName: "/Author/LastName",
  ForeName: "/Author/ForeName",
  Affiliation: "/Author/AffiliationInfo/Affiliation",
  Orcid: "/Author/Identifier/[@Source='ORCID']"
}
```

Here, all paths are relative to the sub-structures `root` path, which is
in turn relative to the parent structure's `root`. This sub-structure
uses the same rules as the parent structure, so it needs both a `root`
and `key` name-value pair. The results of searching each path are
written to separate files. Each file gets a column for the parent and
child key. So in this case, each element of the author is linked by an
author key and that is related to the publication they authored through
the parent key.

The main parser is called recursively to parse this structure so it's
worth thinking about what the root should be under the context that the
parser will be called with that root. This means if, instead of stopping
at `/AuthorList`, `/Author` was added to the end of the root, the parser
would be called for each individual author, instead of once per author
list, leading to all author's getting the index 0.

There are a number of additional syntax constructs to note in the above
example. The key uses the special name `auto_index`, since there is no
author ID in the XML data, an index is used to count the authors in the
order they appear. This resets for each publication and starts at 0.
Treating the `auto_index` as the tail of a path allows you to control
when the indexing occurs—the index is incremented whenever it hits a
`/Author` tag.

In addition to the `auto_index` key, there is a second special index
name, `condensed`.

``` bash
Reference: {
  root: "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
  key: "/condensed"
  PMID: "/ArticleId/[@IdType='pubmed']"
  DOI: "/ArticleId/[@IdType='doi']"
}
```

In the case of `condensed`, instead of writing the results to separate
files, they will printed as columns in the same file, and therefore do
not need an additional key for the sub-structure. If any of the elements
are missing, they will be left blank, for example, if the parser does
not find a pubmed ID for a given reference, the row will look like
`"%s\t\t%s"` where the first string will contain the parent key (the
`PMID` of the publication citing this reference) and the second string
will contain the reference's `DOI`.

The `/[@attribute='value']` syntax at the end of a path tells the parser
to only collect an element if it has an attribute and the attribute's
value matches the supplied value. Similarly the `/@attribute` syntax,
tells the parser to collect the value of the attribute `attribute` along
with the element's value. Then both values will be written to the output
file. Currently only a single attribute can be specified.

Lastly, there is a special syntax for writing condensed sub-structures:

``` bash
Date: "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}"
```

The `{child,child,child}` syntax allows you to select multiple children
at the same level to be printed to a single file. This is useful when
multiple children make up a single piece of information (i.e. the
publication date).

A similar example structure file can be found in the example directory
of this project at:
[file:./example/structure.yml](./example/structure.yml).

# Structure dictionary

The structure of the xml data to read can also be described as a python
dictionary of dictionaries.

The form is similar to the file:

``` python
structure = {
    "root": "//PubmedArticleSet",
    "key": "/PubmedArticle/MedlineCitation/PMID",
    "DOI": "/PubmedArticle/PubmedData/ArticleIdList/ArticleId/[@IdType='doi']",
    "Date": "/PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/{Year,Month,Day}",
    "Journal": "/PubmedArticle/MedlineCitation/Article/Journal/{Title,ISOAbbreviation}",
    "Language": "/PubmedArticle/MedlineCitation/Article/Language",
    "Author": {
        "root": "/PubmedArticle/MedlineCitation/Article/AuthorList",
        "key": "/Author/auto_index",
        "LastName": "/Author/LastName",
        "ForName": "/Author/ForeName",
        "Affiliation": "/Author/AffiliationInfo/Affiliation",
        "Orcid": "/Author/Identifier/[@Source='ORCID']",
    },
    "Grant": {
        "root": "/PubmedArticle/MedlineCitation/Article/GrantList",
        "key": "/Grant/auto_index",
        "ID": "/Grant/GrantID",
        "Agency": "/Grant/Agency",
    },
    "Chemical": "/PubmedArticle/MedlineCitation/ChemicalList/Chemical/NameOfSubstance/@UI",
    "Qualifier": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/QualifierName/@UI",
    "Descriptor": "/PubmedArticle/MedlineCitation/MeshHeadingList/MeshHeading/DescriptorName/@UI",
    "Keywords": "/PubmedArticle/MedlineCitation/KeywordList/Keyword",
    "Reference": {
        "root": (
            "/PubmedArticle/PubmedData/ReferenceList/Reference/ArticleIdList"
        ),
        "key": "/condensed",
        "PMID": "/ArticleId/[@IdType='pubmed']",
        "DOI": "/ArticleId/[@IdType='doi']",
    },
}
```

This can then be passed to `pubmedparser.read_xml` in place of the
structure file.

