Metadata-Version: 1.0
Name: collective.classification
Version: 0.1b1
Summary: Content classification/clustering through language processing
Home-page: http://github.org/ggozad/collective.classification
Author: Yiorgis Gozadinos
Author-email: ggozad@jarn.com
License: GPL
Download-URL: http://pypi.python.org/pypi/collective.classification/
Description: Introduction
        ============
        
        *collective.classification* aims to provide a set of tools for automatic
        document classification. Currently it makes use of the
        `Natural Language Toolkit`_ and features a trainable document classifier based
        on Part Of Speech (POS) tagging, heavily influenced by `topia.termextract`_.
        This product is mostly intended to be used for experimentation and
        development. Currently english and dutch are supported.
        
        .. _`Natural Language Toolkit`: http://www.nltk.org
        .. _`topia.termextract`: http://pypi.python.org/pypi/topia.termextract/
        
        What is this all about?
        =======================
        
        It's mostly about having fun! The package is in a very early experimental
        stage and awaits eagerly contributions. You will get a good understanding of
        what works or not by looking at the tests. You might also be able to do some
        useful things with it:
        
        1) Term extraction can be performed to provide quick insight on what a
        document is about.
        2) On a large site with a lot of content and tags (or subjects in the
        plone lingo) it might be difficult to assign tags to new content. In this
        case, a trained classifier could provide useful suggestions to an editor
        responsible for tagging content.
        3) Clustering can help you organize unclassified content into groups.
        
        How it works?
        =============
        
        At the moment there exist the following type of utilities:
        
        * *POS taggers*, utilities for classifying words in a document
        as `Parts Of Speech`_. Two are provided at the moment, a Penn TreeBank
        tagger and a trigram tagger. Both can be trained with some other language
        than english which is what we do here.
        * *Term extractors*, utilities responsible for extracting the important
        terms from some document. The extractor we use here, assumes that in a
        document only nouns matter and uses a POS tagger to find those mostly used
        in a document. For details please look at the code and the tests.
        * *Content classifiers*, utilities that can tag content in predefined
        categories. Here, a `naive Bayes`_ classifier is used. Basically, the
        classifier looks at already tagged content, performs term extraction and
        trains itself using the terms and tags as an input. Then, for new content,
        the classifier will provide suggestions for tags according to the
        extracted terms of the content.
        * *Clusterers*, utilities that without prior knowledge of content
        classification can group content into groups according to feature
        similarity. At the moment NLTK's `k-means`_ clusterer is used.
        
        
        .. _`Parts Of Speech`: http://en.wikipedia.org/wiki/Part-of-speech_tagging
        .. _`naive Bayes`: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
        .. _`k-means`: http://en.wikipedia.org/wiki/K-means_clustering
        
        Installation & Setup
        ====================
        
        Before running buildout, make sure you have yaml and its python bindings
        installed (use macports on osx, or your package installer on linux). If nltk
        exists for your OS you might as well install that, otherwise it will be
        fetched when you run buildout.
        
        To get started you will simply need to add the package to your "eggs" section
        and run buildout, restart your Plone instance and install the
        "collective.classification" package using the quick-installer or via the
        "Add-on Products" section in "Site Setup".
        
        **WARNING: Upon first time installation linguistic data will be fetched from
        NLTK's repository and stored locally on your filesystem. It's not big (about 400kb) but you need the plone user to have access to its "home". Running the
        tests will also fetch more data from nltk bringing the total to about 225Mb, so not for the faint at disk space.**
        
        How to use it?
        ==============
        * For a parsed document you can call the term view to display the identified
        terms (just append *@@terms* to the url of the content to call the view).
        * In order to use the classifier and get suggested tags for some content,
        you can call *@@suggest-categories* on the content. This comes down to
        appending @@suggest-categories to the url in your browser. A form will
        come up with suggestions, choose the ones that seem appropriate and apply.
        You will need to have the right to edit the document in order to call the
        view.
        * For clustering you can just call the *@@clusterize* view from anywhere.
        The result is not deterministic but hopefully helpful;). You need manager
        rights for this so as to not allow your users to DOS your site!
        
        
        Changelog
        =========
        0.1b1
        -------------------
        - Speed gain by not utilizing the PenTreeBank tagger anymore. [ggozad]
        - Added multi-lingual support, starting with dutch! [ggozad]
        - No need to download all the coprora anymore. [ggozad]
        - A lot of refactoring. Things got moved around and a lot of unnecessary code
        was removed. [ggozad]
        - We now use a Brill/Trigram/Affix tagger that is pre-trained. This
        allows collective.classification to ship without all the corpora. The user
        can still supply a different tagger if necessary. [ggozad]
        - The default nltk PenTreeBank tagger is no longer used. Too slow. [ggozad]
        - npextractor is no longer a local persistent utility. Opted for a global
        non-persisted object. [ggozad]
        - zope.lifecycle events are now used. [ggozad]
        - Gained compatibility with plone 4. [ggozad]
        0.1a3
        -------------------
        - Introduced IClassifiable interface. ATContentTypes are now adapted to it,
        and it should be easier to add other non-AT content types or customize the
        adapter. [ggozad]
        - Handling of IObjectRemovedEvent event. [ggozad]
        - Added a form to import sample content from the brown corpus, for debugging
        and testing. [ggozad]
        - Added some statistics information with @@classification-stats.
        Displays the number of parsed documents, as well as the most useful terms.
        [ggozad]
        - Added @@terms view, allowing a user to inspect the identified terms for some
        content. [ggozad]
        - Have to specify corpus categories when training n-gram tagger. Fixes #3 [ggozad]
        
        0.1a2
        -------------------
        - Made control panel more sane. Fixes #1. [ggozad]
        - NP-extractor has become a local persistent utility. [ggozad]
        - Renamed @@subjectsuggest to @@suggest-categories. Fixes #2. [ggozad]
        - "memoized" term extractor. [ggozad]
        - Added friendly types to the control panel. [ggozad]
        - Updated documentation and dependencies to warn about yaml. [ggozad]
        
        0.1a1
        -------------------
        
        - First public release. [ggozad]
        
Keywords: term-extract,semantic,classification,Parts-Of-Speech,tagging,plone
Platform: Any
Classifier: Environment :: Web Environment
Classifier: Framework :: Plone
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Text Processing :: Indexing
