In earlier article went through ‘Hello’ example of language detection feature of ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java.
In this article, we will go through simple example for string or text tokenization using Apache OpenNLP.
Example for tokenization in this article
- Train a model to tokenize text based on space, comma.
- Then we will do some tests to tokenize few sentences & verify that words & punctuation marks are split correctly.
- We will then enhance training to tokenize java class package names using dot “.” delimiter.
Creating training data
Javadoc API documentation for TokenSampleStream explains the format it expects for training model for tokenization.
“Sample:
“token1 token2 token3<SPLIT>token4”
The tokens token1 and token2 are separated by a whitespace, token3 and token3 are separated by the special character sequence, in this case the default split sequence.“
So lets create a simple file with few lines for space & comma separated samples. By using <SPLIT> tag, you are specifically telling model that split comma from the word preceding it. This way comma will be treated as separate token from the word preceding it i.e. instead of “you,” as one token, it will create “you” & “,” as separate tokens.
1 2 3 |
This is one example of tokenizer. I<SPLIT>, you<SPLIT>, everyone can tokenize. Triangle<SPLIT>, rectangle<SPLIT>, circle<SPLIT>, line are shapes. |
Lets tokenize
Now using above training data, below example –
- Trains a model
- Serializes for further use
- Takes few sentence inputs from console & tokenizes to test out.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
package com.itsallbinary.nlp; import java.io.File; import java.nio.charset.StandardCharsets; import java.util.Scanner; import opennlp.tools.tokenize.TokenSample; import opennlp.tools.tokenize.TokenSampleStream; import opennlp.tools.tokenize.TokenizerFactory; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.InputStreamFactory; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; public class OpnNLPTokenizerExample { public static void main(String[] args) throws Exception { /** * Read human understandable data & train a model */ // Read file with examples of tokenization. InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("tokenizerdata.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); ObjectStream<TokenSample> sampleStream = new TokenSampleStream(lineStream); // Train a model from the file read above TokenizerFactory factory = new TokenizerFactory("en", null, false, null); TokenizerModel model = TokenizerME.train(sampleStream, factory, TrainingParameters.defaultParams()); // Serialize model to some file so that next time we don't have to again train a // model. Next time We can just load this file directly into model. model.serialize(new File("tokenizermodel.bin")); /** * Lets tokenize */ try (Scanner scanner = new Scanner(System.in)) { while (true) { // Get inputs in loop System.out.println("Enter a sentence to tokenize:"); // Initialize tokenizer tool TokenizerME myCategorizer = new TokenizerME(model); // Tokenize sentence. String[] tokens = myCategorizer.tokenize(scanner.nextLine()); for (String t : tokens) { System.out.println("Tokens: " + t); } } } catch (Exception e) { e.printStackTrace(); } } } |
Testing Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
Indexing events with TwoPass using cutoff of 5 Computing event counts... done. 77 events Indexing... done. Sorting and merging events... done. Reduced 77 events to 59. Done indexing in 0.23 s. Incorporating indexed data for training... done. Number of Event Tokens: 59 Number of Outcomes: 2 Number of Predicates: 27 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-53.37233290311578 0.935064935064935 ...<skipped>... 100: ... loglikelihood=-0.8039089303840066 1.0 Enter a sentence to tokenize: I like this Tokens: I Tokens: like Tokens: this Enter a sentence to tokenize: I like mango, apple, banana Tokens: I Tokens: like Tokens: mango Tokens: , Tokens: apple Tokens: , Tokens: banana Enter a sentence to tokenize: package com.itsallbinary.nlp Tokens: package Tokens: com.itsallbinary.nlp Enter a sentence to tokenize: |
As you can see our training worked out fine & it split/tokenized input sentences based on space & commas. But as you can see in third input, the package name “com.itsallbinary.nlp” did not split using dots. Lets say we want to train the model to split individual words from package name using dot “.” as delimiter, then we can re-train model. We just need to add few examples of few packages with <SPLIT> tags as shown below.
1 2 3 4 5 6 |
This is one example of tokenizer. I<SPLIT>, you<SPLIT>, everyone can tokenize. Triangle<SPLIT>, rectangle<SPLIT>, circle<SPLIT>, line are shapes. opennlp<SPLIT>.<SPLIT>tools<SPLIT>.<SPLIT>tokenize java<SPLIT>.<SPLIT>util<SPLIT>.<SPLIT>ArrayList java<SPLIT>.<SPLIT>lang<SPLIT>.<SPLIT>Object |
Now we rerun same program. Now you can see that package name is split in separate tokens.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
Indexing events with TwoPass using cutoff of 5 Computing event counts... done. 131 events Indexing... done. Sorting and merging events... done. Reduced 131 events to 106. Done indexing in 0.24 s. Incorporating indexed data for training... done. Number of Event Tokens: 106 Number of Outcomes: 2 Number of Predicates: 52 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-90.8022806533529 0.8702290076335878 ...<skipped>... 100: ... loglikelihood=-2.5758707433110195 1.0 Enter a sentence to tokenize: package com.itsallbinary.nlp Tokens: package Tokens: com Tokens: . Tokens: itsallbinary Tokens: . Tokens: nlp Enter a sentence to tokenize: |