Integration test
================

Here, we'll test the classifier using a sample of the Brown corpus. The Brown
corpus has a list of POS-tagged english articles which are also conveniently
categorized. The test consists of training the classifier using 20 documents
from each of the categories 'news','editorial' and 'hobbies'. Then we'll ask
the classifier to classify 5 more documents from each category and see what
happens.

We can now start adding documents, starting with the first 20 documents in the
Brown corpus categorized as 'news'.

    >>> from nltk.corpus import brown
    >>> for articleid in brown.fileids(categories='news')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    title=articleid,
    ...                                    text=text,
    ...                                    subject='news')

Continuing with 20 documents categorized as 'editorial':

    >>> for articleid in brown.fileids(categories='editorial')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    title=articleid,
    ...                                    text=text,
    ...                                    subject='editorial')

And finally 20 documents categorized as 'hobbies':

    >>> for articleid in brown.fileids(categories='hobbies')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    title=articleid,
    ...                                    text=text,
    ...                                    subject='hobbies')

All these documents should have been parsed and indexed:

    >>> catalog = self.folder.portal_catalog
    >>> sr = catalog.searchResults(noun_terms='state')
    >>> len(sr) > 5
    True

Let's see what terms we get for the first 'editorial' content:

    >>> browser = self.getBrowser()
    >>> browser.open(self.folder.absolute_url()+'/cb01/@@terms')
    >>> browser.contents
    '...state...year...budget...war...danger...nuclear war...united states...'

Nuclear war and United States? Scary stuff... Time to train the classifier:

    >>> from zope.component import getUtility
    >>> from collective.classification.interfaces import IContentClassifier
    >>> classifier = getUtility(IContentClassifier)
    >>> classifier.train()
    >>> classifier.tags()
    ['editorial', 'hobbies', 'news']

For a start, the classifier should be pretty certain when asked about text
already classified:

    >>> browser.open(self.folder.absolute_url()+'/ca01/@@suggest-categories')
    >>> browser.contents
    '...news 100.0%...editorial 0.0%...hobbies 0.0%...'

So let's see where this gets us, by asking the classifier to categorize 5 more
documents for which we know the category. We will use the classifier's
functions directly this time instead of adding the documents to plone and
calling the *@@suggest-categories* view. 'News' first:

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='news')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['news', 'news', 'news', 'news', 'news']

Let's see how we do with 'editorials'

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='editorial')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['editorial', 'editorial', 'editorial', 'editorial', 'editorial']

That's excellent! What about 'hobbies'?

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='hobbies')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['hobbies', 'hobbies', 'editorial', 'hobbies', 'hobbies']

Not so bad! Overall: we got 14/15 right...

Let's now pick again the first editorial item and see which documents are
similar to it based on the terms we extracted:

    >>> browser.open(self.folder['cb01'].absolute_url()+'/@@similar-items')

The most similar item (a Jaccard index of ~0.2) is
cb15:

    >>> browser.contents
    '...cb15...0.212121212121...'

Let's see what their common terms are:

    >>> cb01terms = catalog.searchResults(
    ... UID=self.folder['cb01'].UID())[0].noun_terms[:20]
    >>> cb15terms = catalog.searchResults(
    ... UID=self.folder['cb15'].UID())[0].noun_terms[:20]
    >>> set(cb01terms).intersection(set(cb15terms))
    set(['development', 'state', 'planning', 'year', 'area'])

which is fine, since both documents talk about development and budget
planning...

What about stats? We can call the *@@stats* view to find out...

    >>> self.setRoles('Manager')
    >>> browser.open(self.folder.absolute_url()+'/@@classification-stats')
    >>> browser.contents
    '...state...True...editorial:hobbies...5.0...'

which basically tells us, that if the word 'state' is present the classifier
gives 5 to 1 for the content to be in the 'editorial' category rather than the
'hobbies' category