Metadata-Version: 2.1
Name: bof
Version: 0.3.4
Summary: Bag of Factors allow you to analyze a corpus from its self_factors.
Home-page: https://github.com/balouf/bof
Author: Fabien Mathieu
Author-email: loufab@gmail.com
License: GNU General Public License v3
Keywords: bof
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Requires-Python: >=3.6
Description-Content-Type: text/x-rst
Requires-Dist: dill
Requires-Dist: numba
Requires-Dist: numpy
Requires-Dist: scipy

==============
Bag of Factors
==============


.. image:: https://img.shields.io/pypi/v/bof.svg
        :target: https://pypi.python.org/pypi/bof
        :alt: PyPI Status

.. image:: https://github.com/balouf/bof/workflows/build/badge.svg?branch=master
        :target: https://github.com/balouf/bof/actions?query=workflow%3Abuild
        :alt: Build Status

.. image:: https://github.com/balouf/bof/workflows/docs/badge.svg?branch=master
        :target: https://github.com/balouf/bof/actions?query=workflow%3Adocs
        :alt: Documentation Status


.. image:: https://codecov.io/gh/balouf/bof/branch/master/graphs/badge.svg
        :target: https://codecov.io/gh/balouf/bof/branch/master/graphs
        :alt: Code Coverage



Bag of Factors allow you to analyze a corpus from its factors.


* Free software: GNU General Public License v3
* Documentation: https://balouf.github.io/bof/.


--------
Features
--------


Feature Extraction
-------------------

The `feature_extraction` module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
with a focus on character-based extraction.

The main differences are:

- it is slightly faster;
- the features can be incrementally updated;
- it is possible to fit only a random sample of factors to reduce space and computation time.

The main entry point for this module is the `CountVectorizer` class, which mimicks
its *scikit-learn* counterpart (also named `CountVectorizer`).
It is in fact very similar to sklearn's `CountVectorizer` using `char` or
`char_wb` analyzer option from that module.


Fuzz
--------

The `fuzz` module mimicks the fuzzywuzzy-like packages like

- fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy)
- rapidfuzz (https://github.com/maxbachmann/rapidfuzz)

The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also
slightly change to enable new features:

- The list of possible choices can be pre-trained (`fit`) to accelerate the computation in
  the case a stream of queries is sent against the same list of choices.
- Instead of one single query, a list of queries can be used. Computations will be parallelized.

The main `fuzz` entry point is the `Process` class.



----------------
Getting Started
----------------

Look at examples from the reference_ section.


-------
Credits
-------

This package was created with Cookiecutter_ and the `francois-durand/package_helper_2`_ project template.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`francois-durand/package_helper_2`: https://github.com/francois-durand/package_helper_2
.. _reference: https://balouf.github.io/bof/reference/index.html


=======
History
=======

---------------------------------------------------
0.3.4 (2021-01-05): Cleaning
---------------------------------------------------

* Renaming process.py to fuzz.py to emphasize that the module aims at being an alternative to the fuzzywuzzy package.
* Removed modules FactorTree and JC. What they did is now essentially covered by the feature_extraction and fuzz
  modules.
* General cleaning / rewriting of the documentation.


---------------------------------------------------
0.3.3 (2021-01-01): Cython/Numba balanced
---------------------------------------------------

* All core CountVectorizer methods ported to Cython. Roughly 2.5X faster than sklearn counterpart (mainly because some features like min_df/max_df are not implemented).
* Process numba methods NOT converted to Cython as Numba seems to be 20% faster for csr manipulation.
* Numba functions are cached to avoid compilation lag.


---------------------------------------------------
0.3.2 (2020-12-30): Going Cython
---------------------------------------------------

* First attempt to use Cython
* Right now only the fit_transform method of CountVectorizer has been cythonized, for testing wheels.
* If all goes well, numba will probably be abandoned and all the heavy-lifting will be in Cython.


-----------------------------------------------------
0.3.1 (2020-12-28): Simplification of core algorithm
-----------------------------------------------------

* Attributes of the CountVectorizer have been reduced to the minimum: one dict!
* Now faster than sklearn counterpart! (The reason been only one case is considered here so we can ditch a lot of checks and attributes).


---------------------------------------------------
0.3.0 (2020-12-15): CountVectorizer and Process
---------------------------------------------------

* The core is now the CountVectorizer class. Lighter and faster. Only features are kept inside.
* New process module inspired by fuzzywuzzy!


---------------------------------
0.2.0 (2020-12-15): Fit/Transform
---------------------------------

* Full refactoring to make the package fit/transform compliant.
* Add a fit_sampling method that allows to fit only a (random) subset of factors


---------------------------------
0.1.1 (2020-12-12): Upgrades
---------------------------------

* Docstrings added
* Common module (feat. save/load capabilities)
* Joint Complexity module

---------------------------------
0.1.0 (2020-12-12): First release
---------------------------------

* First release on PyPI.
* Core FactorTree class added.


