Metadata-Version: 2.4
Name: normal_grain_merge
Version: 0.1.3
Summary: Fused normal and grain merge C extension
Author: Samuel Howard
License: MIT
Project-URL: Homepage, https://github.com/samhaswon/normal_grain_merge
Project-URL: Bug Tracker, https://github.com/samhaswon/normal_grain_merge/issues
Keywords: image,processing
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Dynamic: license-file

# normal_grain_merge

This implements a combined version of the blend modes normal and grain merge.
Grain merge is performed on *s* and *t* with the result normal-merged with *b*.
Subscripts indicate channels, with alpha (α) channels broadcast to three channels.

$$
(((\mathrm{t_{rgb}} + \mathrm{s_{rgb}} - 0.5) * \mathrm{t_\alpha} + \mathrm{t_{rgb}} * (1 - \mathrm{t_\alpha})) * (1 - 0.3) + \mathrm{s_{rgb}} * 0.3) * \mathrm{t_\alpha} + \mathrm{b_{rgb}} * (1 - \mathrm{t_\alpha})
$$

## Installation

```shell
pip install normal-grain-merge
```

## Usage
```py
import numpy as np
from normal_grain_merge import normal_grain_merge, KernelKind


# Example arrays
base = np.zeros((100, 100, 3), dtype=np.uint8)
texture = np.zeros((100, 100, 3), dtype=np.uint8)
skin = np.zeros((100, 100, 4), dtype=np.uint8)
im_alpha = np.zeros((100, 100), dtype=np.uint8)

result_scalar = normal_grain_merge(base, texture, skin, im_alpha, KernelKind.KERNEL_SCALAR.value)
print(result_scalar.shape, result_scalar.dtype)
```

There are three kernels implemented in this module as defined in `KernelKind`.

- `KERNEL_AUTO`: Automatically chooses the kernel, preferring AVX2
- `KERNEL_SCALAR`: Portable scalar implementation.
- `KERNEL_SSE42`: SSE4.2 intrinsics kernel. Likely better on AMD CPUs.
- `KERNEL_AVX2`: AVX2 intrinsics kernel. Likely better on Intel CPUs.

### Parameters

All input matrices should have the same height and width.

#### `base`

RGB or RGBA, dropping the alpha channel if it exists.
The base image for application.

#### `texture`

RGB or RGBA, applying the alpha if it exists.
This is the texture to be applied.

#### `skin`

RGBA, the segmented portion of base to texture.
The "skin" of the object the texture is to be applied to.

#### `im_alpha`

The alpha of parameter `skin`. 
This is mostly a holdover from the Python implementation to deal with NumPy.

#### `kernel`

One of `KernelKind`.

## Performance

The entire reason for me writing this was NumPy being slow when this operation is in the hot path.
So, I decided to write a SIMD version that does the type casting outside NumPy with only the intermediate values being in FP32.

How much of a speedup is this? All numbers are from a Ryzen 7 4800H running Ubuntu 24.04 and Python 3.12.3.

| Method/Kernel     | Average Iteration Time (RGB) | Average Iteration Time (RGBA) |
|-------------------|------------------------------|-------------------------------|
| C scalar kernel   | 0.016109s                    | 0.016679s                     |
| C SSE4.2 kernel   | 0.002446s                    | 0.002478s                     |
| C AVX2 kernel     | 0.002336s                    | 0.002520s                     |
| NumPy version     | 0.160623s                    | 0.258044s                     |
| Old NumPy version | 0.248160s                    | 0.232046s                     |

| Method Comparison  | Speedup (RGB) | Speedup (RGBA) |
|--------------------|---------------|----------------|
| NumPy -> scalar    | 89.9709%      | 93.5363%       |
| NumPy -> SSE4.2    | 98.4769%      | 99.0397%       |
| NumPy -> AVX2      | 98.5454%      | 99.0235%       |
| Old np -> SSE4.2   | 99.0142%      | 98.9321%       |
| Old np -> AVX2     | 99.0585%      | 98.9141%       |
| C scalar -> SSE4.2 | 84.8135%      | 85.1437%       |
| C scalar -> AVX2   | 85.4964%      | 84.8923%       |
