In this article we will create a simple example of document categorizer or classifier feature of ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java.
Example for document categorizer in this article
- Train a model with samples of positive & negative review comments.
- Do some tests with other review comments (input from console) & verify that reviews are categorized correctly as positive or negative.
- Retrain model for some additional samples to improve results.
Creating training data
As per javadoc documentation of DocumentSampleStream, sample format is this.
“Format:
Each line contains one sample document.
The category is the first string in the line followed by a tab and whitespace separated document tokens.
Sample line: category-string tab-char whitespace-separated-tokens line-break-char(s)”
So lets create a sample data file for model training with few positive & negative reviews.
1 2 |
positive I love this. I like this. I really love this product. We like this. negative I hate this. I dislike this. We absolutely hate this. I really hate this product. |
So we have 2 categories “positive” & “negative” followed by its sample sentences.
Lets train a model & categorize few reviews
We will use DoccatFactory to train using above samples & create DoccatModel. We will use BagOfWordsFeatureGenerator to generate features of each word. This will take each word in sample & count how many times that word occurred in give category. For ex: In “positive” category we have word “love” 2 times so this will be used for weighing reviews during categorization.
We have given CUT_OFF as zero for this example purpose. Cut off value is used to avoid words as feature whose counts are less than cut off. If cut off was more than 2, then word “love” might not be considered as feature & we might get wrong results. Generally cut off value is useful to avoid creating unnecessary features for words which rarely occur. In this example word “I” appears 6 times, so if you change cut off to 7, not a single word qualifies to be a feature so you will get
Exception in thread "main" opennlp.tools.util.InsufficientTrainingDataException: Insufficient training data to create model.
Here is the code for training & testing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
package com.itsallbinary.nlp; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.nio.charset.StandardCharsets; import java.util.Scanner; import opennlp.tools.doccat.BagOfWordsFeatureGenerator; import opennlp.tools.doccat.DoccatFactory; import opennlp.tools.doccat.DoccatModel; import opennlp.tools.doccat.DocumentCategorizerME; import opennlp.tools.doccat.DocumentSample; import opennlp.tools.doccat.DocumentSampleStream; import opennlp.tools.doccat.FeatureGenerator; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.InputStreamFactory; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; import opennlp.tools.util.model.ModelUtil; public class OpenNLPDocumentCategorizerExample { public static void main(String[] args) throws Exception { /** * Read human understandable data & train a model */ // Read file with classifications samples of sentences. InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("documentcategorizer.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); // Use CUT_OFF as zero since we will use very few samples. // BagOfWordsFeatureGenerator will treat each word as a feature. Since we have // few samples, each feature/word will have small counts, so it won't meet high // cutoff. TrainingParameters params = ModelUtil.createDefaultTrainingParameters(); params.put(TrainingParameters.CUTOFF_PARAM, 0); DoccatFactory factory = new DoccatFactory(new FeatureGenerator[] { new BagOfWordsFeatureGenerator() }); // Train a model with classifications from above file. DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory); // Serialize model to some file so that next time we don't have to again train a // model. Next time We can just load this file directly into model. model.serialize(new File("documentcategorizer.bin")); /** * Load model from serialized file & lets categorize reviews. */ // Load serialized trained model try (InputStream modelIn = new FileInputStream("documentcategorizer.bin"); Scanner scanner = new Scanner(System.in);) { while (true) { // Get inputs in loop System.out.println("Enter a sentence:"); // Initialize document categorizer tool DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model); // Get the probabilities of all outcome i.e. positive & negative double[] probabilitiesOfOutcomes = myCategorizer.categorize(getTokens(scanner.nextLine())); // Get name of category which had high probability String category = myCategorizer.getBestCategory(probabilitiesOfOutcomes); System.out.println("Category: " + category); } } catch (Exception e) { e.printStackTrace(); } } /** * Tokenize sentence into tokens. * * @param sentence * @return */ private static String[] getTokens(String sentence) { // Use model that was created in earlier tokenizer tutorial try (InputStream modelIn = new FileInputStream("tokenizermodel.bin")) { TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn)); String[] tokens = myCategorizer.tokenize(sentence); for (String t : tokens) { System.out.println("Tokens: " + t); } return tokens; } catch (Exception e) { e.printStackTrace(); } return null; } } |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
Indexing events with TwoPass using cutoff of 0 Computing event counts... done. 2 events Indexing... done. Sorting and merging events... done. Reduced 2 events to 2. Done indexing in 0.22 s. Incorporating indexed data for training... done. Number of Event Tokens: 2 Number of Outcomes: 2 Number of Predicates: 11 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-1.3862943611198906 0.5 ...<skipped>... 100: ... loglikelihood=-0.0666841163960054 1.0 Enter a sentence: They like this Tokens: They Tokens: like Tokens: this Category: positive Enter a sentence: They hate this Tokens: They Tokens: hate Tokens: this Category: negative Enter a sentence: I think this is bad Tokens: I Tokens: think Tokens: this Tokens: is Tokens: bad Category: positive Enter a sentence: |
We gave 2 new review comments as input & as you can see in console, program correctly categorized. But the 3rd review comment was wrongly categorized. Reason is that we did not have such samples in our training data.
Re-train model to enhance new samples
Now we update our samples & retrain model using same program above.
1 2 |
positive I love this. I like this. I really love this product. We like this. negative I hate this. I dislike this. We absolutely hate this. I really hate this product. I think this is bad. This is bad. |
Lets test using same program again. Now you can see the new review comment is categorized correctly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
Indexing events with TwoPass using cutoff of 0 Computing event counts... done. 2 events Indexing... done. Sorting and merging events... done. Reduced 2 events to 2. Done indexing in 0.26 s. Incorporating indexed data for training... done. Number of Event Tokens: 2 Number of Outcomes: 2 Number of Predicates: 15 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-1.3862943611198906 0.5 ...<skipped>... 100: ... loglikelihood=-0.06276918954777459 1.0 Enter a sentence: They think this is bad Tokens: They Tokens: think Tokens: this Tokens: is Tokens: bad Category: negative Enter a sentence: |
I’m a beginner in the NLP and I have a question regarding getTokens() function. Why do we need the TokernizerModel and when do we need it?