Natural Language Processing in Java using Apache OpenNLP | Language Detection | Simple Hello example for beginners

In this article we will simply understand basics of ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java. We will focus on ‘Language detection’ feature of NLP to detect language from simple greeting.

Example for NLP language detection in this article

  • Train a model to understand simple greetings in different languages like “hello”, “hola”, “namaste” etc.
  • Then we will do some tests to detect language.
  • We will then enhance training to adjust results to our expectations.

Here is the maven library dependency for Apache OpenNLP which we will use in our example.

Concepts

  • Model
    • ‘Model’ is a kind of a set of learnings which are acquired by a process called ‘training’.
    • In Apache OpenAPI, class like LanguageDetectorModel.java represent model for language detection feature of NLP.
  • Learn-able tool
    • Learn-able tool are the ones which take the learnings from model & then use that to produce outputs for the inputs that you give.
    • In Apache OpenAPI, class like LanguageDetectorME.java represent a tool which actually detects languages for given inputs.



Create training samples data file for “hello” in many languages

Now our first objective is to create a model & train it to understand human greetings in different languages like “hello”, “hola”, “namaste” etc. We can search online for ‘hello in 100 languages’ & get any results to use for training. Here is the one which we will use for our example.

Now to put this data into a file & feed to model, we need to use a stream class i.e. LanguageDetectorSampleStream. As per API documentation, this expects file in format as

Format:
Each line contains one sample document.
The language is the first string in the line followed by a tab and the document content.
Sample line: category-string tab-char document line-break-char(s)

So lets create such a file with given format from the online data that we got. File will somewhat look like

Ideally ISO-639-3 language codes should be used for language names but since we are just trying to understand, we will stick to what we got online.


Lets train a model to understand ‘Hello’

Now we will train a model LanguageDetectorModel using Apache OpenNLP API code as shown in below example. After training, we will serialize the model so that next time we don’t have to keep training again. We can just reuse existing training.

Lets use & test trained model

Now that our model is ready & serialized, we can test using LanguageDetectorME as shown in below example. We will loop to take inputs from console so that we can do multiple tests.

Output

Now as you can see, our language detector correctly identified “hello”, “hola” & “namaste” with good confidence. But when we gave a unknown word “howdy”, it tried to get best probable language (but not correct one). Confidence level was also low.


Improving learning of model

Now our model’s training is good for “hello”, “hola” & “namaste”. But we found that “howdy” prediction was not good. So we can improve our learning data & retrain model. Lets add “howdy” to English language.

Now we have to rerun the training code & then test code. Here is our output after this change.

For real world applications, you don’t have to train model from scratch. you can use existing models online like this one from Apache.





Here is the complete code




Here is the text file which we used for training.



Further reading on Apache NLP

Natural Language Processing in Java using Apache OpenNLP | String tokenizer | Simple example for beginners

2 Replies to “Natural Language Processing in Java using Apache OpenNLP | Language Detection | Simple Hello example for beginners”

  1. Hi,

    I am facing this error while running this program.Can you please help?

    Indexing events with OnePass using cutoff of 5

    Computing event counts… done. 70 events
    Indexing… Exception in thread “main” java.lang.NoSuchMethodError: java.util.Comparator.comparingInt(Ljava/util/function/ToIntFunction;)Ljava/util/Comparator;
    at opennlp.tools.ml.model.AbstractDataIndexer.toIndexedStringArray(AbstractDataIndexer.java:236)
    at opennlp.tools.ml.model.AbstractDataIndexer.index(AbstractDataIndexer.java:186)
    at opennlp.tools.ml.model.OnePassDataIndexer.index(OnePassDataIndexer.java:54)
    at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:68)
    at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:90)
    at opennlp.tools.langdetect.LanguageDetectorME.train(LanguageDetectorME.java:92)
    at mqg.support.chatbot.OpenNLPLanguageDetectionExamle.main(OpenNLPLanguageDetectionExamle.java:36)

Leave a Reply

Your email address will not be published. Required fields are marked *