Natural Language Processing in Java using Apache OpenNLP | String tokenizer | Simple example for beginners

In earlier article went through ‘Hello’ example of language detection feature of  ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java.

Natural Language Processing in Java using Apache OpenNLP | Language Detection | Simple Hello example for beginners

In this article, we will go through simple example for string or text tokenization using Apache OpenNLP.

Example for tokenization in this article

  • Train a model to tokenize text based on space, comma.
  • Then we will do some tests to tokenize few sentences & verify that words & punctuation marks are split correctly.
  • We will then enhance training to tokenize java class package names using dot “.” delimiter.

Creating training data

Javadoc API documentation for TokenSampleStream explains the format it expects for training model for tokenization.

Sample:
“token1 token2 token3<SPLIT>token4”
The tokens token1 and token2 are separated by a whitespace, token3 and token3 are separated by the special character sequence, in this case the default split sequence.

So lets create a simple file with few lines for space & comma separated samples. By using <SPLIT> tag, you are specifically telling model that split comma from the word preceding it. This way comma will be treated as separate token from the word preceding it i.e. instead of “you,” as one token, it will create “you” & “,” as separate tokens.



Lets tokenize

Now using above training data, below example –

  1. Trains a model
  2. Serializes for further use
  3. Takes few sentence inputs from console & tokenizes to test out.




Testing Output:




As you can see our training worked out fine & it split/tokenized input sentences based on space & commas. But as you can see in third input, the package name “com.itsallbinary.nlp” did not split using dots. Lets say we want to train the model to split individual words from package name using dot “.” as delimiter, then we can re-train model. We just need to add few examples of few packages with <SPLIT> tags as shown below.

Now we rerun same program. Now you can see that package name is split in separate tokens.



Further Reading

Natural Language Processing in Java using Apache OpenNLP | Document Categorizer | Simple example for beginners

Leave a Reply

Your email address will not be published. Required fields are marked *