================
Integration test
================

Here, we'll test the classifier using a sample of the Brown corpus. The Brown corpus has a list of POS-tagged english articles which are also conveniently categorized. The test consists of training the classifier using 20 documents from each of the categories 'news','editorial' and 'hobbies'. Then we'll ask the classifier to classify 5 more documents from each category and see what happens.

Let us setup the tagger first. You can do this by means of using the control panel, here we do it manually. First we fetch the N-gram tagger.

    >>> from zope.component import getUtility
    >>> from collective.classification.interfaces import IPOSTagger
    >>> tagger = getUtility(IPOSTagger,
    ...     name="collective.classification.taggers.NgramTagger")
    >>> tagger
    <collective.classification.nltkutilities.tagger.NgramTagger ...>

Now we can train the tagger with the tagged sentences from the Brown corpus corresponding to the categories we have chosen:

    >>> from nltk.corpus import brown
    >>> tagged_sents =  brown.tagged_sents(
    ...     categories=['news','editorial','hobbies'])
    >>> tagged_sents
    [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...],...]
    >>> tagger.train(tagged_sents)

Let's assign the tagger to the NP-extractor and assign the extractor to the storage:

    >>> from collective.classification.classifiers.npextractor import NPExtractor
    >>> extractor = NPExtractor(tagger=tagger)
    >>> from collective.classification.interfaces import INounPhraseStorage
    >>> npstorage = getUtility(INounPhraseStorage)
    >>> npstorage.extractor = extractor

We can now start adding documents, starting with the first 20 documents in Brown categorized as 'news'.

    >>> for articleid in brown.fileids(categories='news')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text,
    ...                                    subject='news')

Continuing with 20 documents categorized as 'editorial':

    >>> for articleid in brown.fileids(categories='editorial')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text,
    ...                                    subject='editorial')

And finally 20 documents categorized as 'hobbies':

    >>> for articleid in brown.fileids(categories='hobbies')[:20]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text,
    ...                                    subject='hobbies')

Time to train the classifier:
    >>> from collective.classification.interfaces import IContentClassifier
    >>> classifier = getUtility(IContentClassifier)
    >>> classifier.train()
    >>> classifier.tags()
    ['editorial', 'hobbies', 'news']

For a start, the classifier should be pretty certain when asked about text already classified:

    >>> browser = self.getBrowser()
    >>> browser.open(self.folder.absolute_url()+'/ca01/@@subjectsuggest')
    >>> browser.contents
    '...news 100.0%...editorial 0.0%...hobbies 0.0%...'

So let's see where this gets us, by asking the classifier to categorize 5 more documents for which we know the category. We will use the classifier's functions this time instead of adding the documents to plone. 'News' first:

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='news')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['news', 'news', 'news', 'news', 'news']

Let's see how we do with 'editorials'  

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='editorial')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['editorial', 'editorial', 'editorial', 'editorial', 'editorial']

That's excellent! What about 'hobbies'?

    >>> classificationResult = []
    >>> for articleid in brown.fileids(categories='hobbies')[20:25]:
    ...     text = " ".join(brown.words(articleid))
    ...     id = self.folder.invokeFactory('Document',articleid,
    ...                                    text=text)
    ...     uid = self.folder[id].UID()
    ...     classificationResult.append(classifier.classify(uid))
    >>> classificationResult
    ['hobbies', 'editorial', 'hobbies', 'hobbies', 'editorial']

Not so good, we missed 2 out of 5. Overall: we got 13/15 right...



