Metadata-Version: 2.1
Name: pydftracer
Version: 1.0.3
Summary: I/O profiler for deep learning python apps. Specifically for dlio_benchmark.
Home-page: https://github.com/hariharan-devarajan/dftracer
Author: Hariharan Devarajan (Hari)
Project-URL: Bug Reports, https://github.com/hariharan-devarajan/dftracer/issues
Project-URL: Source, https://github.com/hariharan-devarajan/dftracer
Keywords: profiler,deep learning,I/O,benchmark,NPZ,pytorch benchmark,tensorflow benchmark
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3 :: Only
Requires: pybind11
Requires: setuptools
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pybind11
Provides-Extra: dfanalyzer
Requires-Dist: seaborn >=0.13.2 ; extra == 'dfanalyzer'
Requires-Dist: bokeh >=2.4.2 ; extra == 'dfanalyzer'
Requires-Dist: pybind11 ; extra == 'dfanalyzer'
Requires-Dist: zindex-py ==0.0.2 ; extra == 'dfanalyzer'
Requires-Dist: pandas >=2.0.3 ; extra == 'dfanalyzer'
Requires-Dist: dask >=2023.5.0 ; extra == 'dfanalyzer'
Requires-Dist: distributed ; extra == 'dfanalyzer'
Requires-Dist: numpy >=1.24.3 ; extra == 'dfanalyzer'
Requires-Dist: pyarrow >=12.0.1 ; extra == 'dfanalyzer'
Requires-Dist: rich >=13.6.0 ; extra == 'dfanalyzer'
Requires-Dist: python-intervals >=1.10.0.post1 ; extra == 'dfanalyzer'
Requires-Dist: matplotlib >=3.7.3 ; extra == 'dfanalyzer'
Provides-Extra: test
Requires-Dist: pytest >=6.0 ; extra == 'test'

[![DFTracer Build and Test](https://github.com/hariharan-devarajan/dftracer/actions/workflows/ci.yml/badge.svg)](https://github.com/hariharan-devarajan/dftracer/actions/workflows/ci.yml)
[![Coverage Status](https://coveralls.io/repos/github/hariharan-devarajan/dftracer/badge.svg?branch=feature/apis)](https://coveralls.io/github/hariharan-devarajan/dftracer?branch=dev)
[![Documentation Status](https://readthedocs.org/projects/dftracer/badge/?version=latest)](https://dftracer.readthedocs.io/en/latest/?badge=latest)

# DFTracer v1.0.3
A multi-level profiler for capturing application functions and low-level system I/O calls from deep learning workloads.

Requirements for profiler
1. Python > 3.7
2. pybind11

Requirements for analyzer
1. bokeh>=2.4.2
2. pybind11
3. [zindex_py](https://github.com/hariharan-devarajan/zindex.git)
4. pandas>=2.0.3
5. dask>=2023.5.0
6. distributed
7. numpy>=1.24.3
8. pyarrow>=12.0.1
9. rich>=13.6.0
10. python-intervals>=1.10.0.post1
11. matplotlib>=3.7.3

## Installation

Users can easily install DFTracer using pip. This is the way most Python packages are installed.
This method would work for both native Python environments and Conda environments.

### From PyPI

```bash
pip install pydftracer
```

### From Github

```bash
DFT_VERSION=dev
pip install git+https://github.com/hariharan-devarajan/dftracer.git@${DFT_VERSION}
```

### From source

```bash
git clone git@github.com:hariharan-devarajan/dftracer.git
cd dftracer
# You can skip this for installing the dev branch.
# for latest stable version use master branch.
git checkout tags/<Release> -b <Release>
pip install .
```

For more build instructions check [here](https://dftracer.readthedocs.io/en/latest/build.html).

## Usage

```python
from dftracer.logger import dftracer, dft_fn
log_inst = dftracer.initialize_log(logfile=None, data_dir=None, process_id=-1)
dft_fn = dft_fn("COMPUTE")

# Example of using function decorators
@dft_fn.log
def log_events(index):
    sleep(1)

# Example of function spawning and implicit I/O calls
def posix_calls(val):
    index, is_spawn = val
    path = f"{cwd}/data/demofile{index}.txt"
    f = open(path, "w+")
    f.write("Now the file has more content!")
    f.close()
    if is_spawn:
        print(f"Calling spawn on {index} with pid {os.getpid()}")
        log_inst.finalize() # This need to be called to correctly finalize DFTracer.
    else:
        print(f"Not calling spawn on {index} with pid {os.getpid()}")

# NPZ calls internally calls POSIX calls.
def npz_calls(index):
    # print(f"{cwd}/data/demofile2.npz")
    path = f"{cwd}/data/demofile{index}.npz"
    if os.path.exists(path):
        os.remove(path)
    records = np.random.randint(255, size=(8, 8, 1024), dtype=np.uint8)
    record_labels = [0] * 1024
    np.savez(path, x=records, y=record_labels)

def main():
    log_events(0)
    npz_calls(1)
    with get_context('spawn').Pool(1, initializer=init) as pool:
        pool.map(posix_calls, ((2, True),))
    log_inst.finalize()

if __name__ == "__main__":
    main()
```

For this example, as the `dftracer.initialize_log` do not pass `logfile` or `data_dir`, we need to set `DFTRACER_LOG_FILE` and `DFTRACER_DATA_DIR`.
By default the DFTracer mode is set to `FUNCTION`.
Example of running this configurations are:

```bash
# The process id, app_name and .pfw will be appended by the profiler for each app and process.
# The name of the final log file is ~/log_file-<APP_NAME>-<PID>.pfw
DFTRACER_LOG_FILE=~/log_file
# Colon separated paths for including for profiler
DFTRACER_DATA_DIR=/dev/shm/:/p/gpfs1/$USER/dataset:$PWD/data
# Enable profiler
DFTRACER_ENABLE=1
```

For more example check [Examples](https://dftracer.readthedocs.io/en/latest/examples.html).
