Natural Language Processing in Java using Apache OpenNLP | Document Categorizer | Simple example for beginners

In this article we will create a simple example of document categorizer or classifier feature of  ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java.

Example for document categorizer in this article

  • Train a model with samples of positive & negative review comments.
  • Do some tests with other review comments (input from console) & verify that reviews are categorized correctly as positive or negative.
  • Retrain model for some additional samples to improve results.

Creating training data

As per javadoc documentation of DocumentSampleStream, sample format is this.

“Format:
Each line contains one sample document.
The category is the first string in the line followed by a tab and whitespace separated document tokens.
Sample line: category-string tab-char whitespace-separated-tokens line-break-char(s)”

So lets create a sample data file for model training with few positive & negative reviews.

So we have 2 categories “positive” & “negative” followed by its sample sentences.


Lets train a model & categorize few reviews

We will use DoccatFactory to train using above samples & create DoccatModel. We will use BagOfWordsFeatureGenerator to generate features of each word. This will take each word in sample & count how many times that word occurred in give category. For ex: In “positive” category we have word “love” 2 times so this will be used for weighing reviews during categorization.

We have given CUT_OFF as zero for this example purpose. Cut off value is used to avoid words as feature whose counts are less than cut off. If cut off was more than 2, then word “love” might not be considered as feature & we might get wrong results.  Generally cut off value is useful to avoid creating unnecessary features for words which rarely occur. In this example word “I” appears 6 times, so if you change cut off to 7, not a single word qualifies to be a feature so you will get Exception in thread "main" opennlp.tools.util.InsufficientTrainingDataException: Insufficient training data to create model.



Here is the code for training & testing.




Output:

We gave 2 new review comments as input & as you can see in console, program correctly categorized. But the 3rd review comment was wrongly categorized. Reason is that we did not have such samples in our training data.


Re-train model to enhance new samples

Now we update our samples & retrain model using same program above.

Lets test using same program again. Now you can see the new review comment is categorized correctly.

 



 

Leave a Reply

Your email address will not be published. Required fields are marked *