In this article we will simply understand basics of ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java. We will focus on ‘Language detection’ feature of NLP to detect language from simple greeting.
Example for NLP language detection in this article
- Train a model to understand simple greetings in different languages like “hello”, “hola”, “namaste” etc.
- Then we will do some tests to detect language.
- We will then enhance training to adjust results to our expectations.
Here is the maven library dependency for Apache OpenNLP which we will use in our example.
Concepts
- Model
- ‘Model’ is a kind of a set of learnings which are acquired by a process called ‘training’.
- In Apache OpenAPI, class like LanguageDetectorModel.java represent model for language detection feature of NLP.
- Learn-able tool
- Learn-able tool are the ones which take the learnings from model & then use that to produce outputs for the inputs that you give.
- In Apache OpenAPI, class like LanguageDetectorME.java represent a tool which actually detects languages for given inputs.
Create training samples data file for “hello” in many languages
Now our first objective is to create a model & train it to understand human greetings in different languages like “hello”, “hola”, “namaste” etc. We can search online for ‘hello in 100 languages’ & get any results to use for training. Here is the one which we will use for our example.
Now to put this data into a file & feed to model, we need to use a stream class i.e. LanguageDetectorSampleStream. As per API documentation, this expects file in format as
Format:
Each line contains one sample document.
The language is the first string in the line followed by a tab and the document content.
Sample line: category-string tab-char document line-break-char(s)
So lets create such a file with given format from the online data that we got. File will somewhat look like
1 2 3 4 5 6 7 8 9 10 |
Afrikaans Hallo Albanian Mirë dita Amharic ታዲያስ tadiyas . . English Hello . Hindi नमस्ते Namaste . Spanish Hola |
Ideally ISO-639-3 language codes should be used for language names but since we are just trying to understand, we will stick to what we got online.
Lets train a model to understand ‘Hello’
Now we will train a model LanguageDetectorModel using Apache OpenNLP API code as shown in below example. After training, we will serialize the model so that next time we don’t have to keep training again. We can just reuse existing training.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
// Read file with greetings in many languages InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("helloInManyLanguages.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); ObjectStream<LanguageSample> sampleStream = new LanguageDetectorSampleStream(lineStream); // Train a model from the greetings with many languages. LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, ModelUtil.createDefaultTrainingParameters(), new LanguageDetectorFactory()); // Serialize model to some file so that next time we don't have to again train a // model. Next time We can just load this file directly into model. model.serialize(new File("helloInManyLanguagesModel.bin")); |
Lets use & test trained model
Now that our model is ready & serialized, we can test using LanguageDetectorME as shown in below example. We will loop to take inputs from console so that we can do multiple tests.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
try (InputStream modelIn = new FileInputStream("helloInManyLanguagesModel.bin"); Scanner scanner = new Scanner(System.in);) { // Load serialized trained model LanguageDetectorModel trainedModel = new LanguageDetectorModel(modelIn); while (true) { // Get inputs in loop System.out.println("Enter a greeting:"); String inputText = scanner.nextLine(); // Initialize language detector tool LanguageDetectorME myCategorizer = new LanguageDetectorME(trainedModel); // Get language prediction based on learnings. Language bestLanguage = myCategorizer.predictLanguage(inputText); System.out.println("Best language: " + bestLanguage.getLang()); System.out.println("Best language confidence: " + bestLanguage.getConfidence()); } } catch (Exception e) { e.printStackTrace(); } |
Output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
Indexing events with OnePass using cutoff of 5 Computing event counts... done. 70 events Indexing... done. Sorting and merging events... done. Reduced 70 events to 70. Done indexing in 0.13 s. Incorporating indexed data for training... done. Number of Event Tokens: 70 Number of Outcomes: 70 Number of Predicates: 62 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-297.3946669434548 0.014285714285714285 ...<skipped lines>... 100: ... loglikelihood=-22.958451392996306 0.9142857142857143 Enter a greeting: hello Best language: English Best language confidence: 0.8374414658961518 Enter a greeting: hola Best language: Spanish Best language confidence: 0.8030706130757507 Enter a greeting: namaste Best language: Hindi Best language confidence: 0.7832174181668613 Enter a greeting: howdy Best language: Swiss German (Informal) Best language confidence: 0.22882771281091271 Enter a greeting: |
Now as you can see, our language detector correctly identified “hello”, “hola” & “namaste” with good confidence. But when we gave a unknown word “howdy”, it tried to get best probable language (but not correct one). Confidence level was also low.
Improving learning of model
Now our model’s training is good for “hello”, “hola” & “namaste”. But we found that “howdy” prediction was not good. So we can improve our learning data & retrain model. Lets add “howdy” to English language.
1 2 3 4 5 6 7 8 9 10 |
Afrikaans Hallo Albanian Mirë dita Amharic ታዲያስ tadiyas . . English Hello howdy . Hindi नमस्ते Namaste . Spanish Hola |
Now we have to rerun the training code & then test code. Here is our output after this change.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
Indexing events with OnePass using cutoff of 5 Computing event counts... done. 70 events Indexing... done. Sorting and merging events... done. Reduced 70 events to 70. Done indexing in 0.14 s. Incorporating indexed data for training... done. Number of Event Tokens: 70 Number of Outcomes: 70 Number of Predicates: 62 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-297.3946669434548 0.014285714285714285 ...<skiped lines>... 100: ... loglikelihood=-22.83894306766496 0.9142857142857143 Enter a greeting: hello Best language: English Best language confidence: 0.17837730981405164 Enter a greeting: hola Best language: Spanish Best language confidence: 0.8012867582252314 Enter a greeting: namaste Best language: Hindi Best language confidence: 0.7833156996881588 Enter a greeting: howdy Best language: English Best language confidence: 0.3220329440331727 Enter a greeting: |
For real world applications, you don’t have to train model from scratch. you can use existing models online like this one from Apache.
Here is the complete code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
package com.itsallbinary.nlp; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.nio.charset.StandardCharsets; import java.util.Scanner; import opennlp.tools.langdetect.Language; import opennlp.tools.langdetect.LanguageDetectorFactory; import opennlp.tools.langdetect.LanguageDetectorME; import opennlp.tools.langdetect.LanguageDetectorModel; import opennlp.tools.langdetect.LanguageDetectorSampleStream; import opennlp.tools.langdetect.LanguageSample; import opennlp.tools.util.InputStreamFactory; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.model.ModelUtil; public class OpenNLPLanguageDetectionExamle { public static void main(String[] args) throws Exception { /** * Read human understandable data & train a model */ // Read file with greetings in many languages InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory( new File("helloInManyLanguages.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); ObjectStream<LanguageSample> sampleStream = new LanguageDetectorSampleStream(lineStream); // Train a model from the greetings with many languages. LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, ModelUtil.createDefaultTrainingParameters(), new LanguageDetectorFactory()); // Serialize model to some file so that next time we don't have to again train a // model. Next time We can just load this file directly into model. model.serialize(new File("helloInManyLanguagesModel.bin")); /** * Load model from serialized file & lets detect languages. */ try (InputStream modelIn = new FileInputStream("helloInManyLanguagesModel.bin"); Scanner scanner = new Scanner(System.in);) { // Load serialized trained model LanguageDetectorModel trainedModel = new LanguageDetectorModel(modelIn); while (true) { // Get inputs in loop System.out.println("Enter a greeting:"); String inputText = scanner.nextLine(); // Initialize language detector tool LanguageDetectorME myCategorizer = new LanguageDetectorME(trainedModel); // Get language prediction based on learnings. Language bestLanguage = myCategorizer.predictLanguage(inputText); System.out.println("Best language: " + bestLanguage.getLang()); System.out.println("Best language confidence: " + bestLanguage.getConfidence()); } } catch (Exception e) { e.printStackTrace(); } } } |
Here is the text file which we used for training.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
Afrikaans Hallo Albanian Mirë dita Amharic ታዲያስ tadiyas Arabic مرحبا or مَرْحَبًا marhaban or Marhabaa Azerbaijani Салам or سلام Salam Bengali নমস্কার nomoshkaar or namaskar Bosnian Zdravo Bulgarian (Formal) Здравей zdravey Bulgarian (Informal) Здрасти zdrasti Croatian Bok Czech ahoj Danish Hej Dutch Hallo English Hello howdy Esperanto Saluton Estonian Tere Farsi سلام or درود بر تو or درود بر شما Salaam or Dorood bar to or D orood bar shoma Fijian Bula Finnish Terve French (Informal) Salut Greek Γεια σου yiassoo Hawaiian Aloha Hebrew שלום Shalom Hindi नमस्ते Namaste Hungarian (Plural) Sziasztok Hungarian (Singular) Szia Indonesian Halo or Hai Irish (Plural) Dia dhaoibh Italian (Formal) Salve Italian (Informal) Ciao Japanese こんにちは Kon'nichiwa Kannada ನಮಸ್ಕಾರ namaskār Korean (Formal) 안녕하세요 an-nyeong-ha-se-yo Korean (Informal) 안녕 annyeong Latin (Plural) Salvete Latin (Singular) Salve Latvian Sveiki Limburgish Hallau Lithuanian Sveiki Macedonian Добар ден Dobar den Malaysian (Noon to 2pm) Selamat tengahari Malaysian (2pm to sunset) Selamat petang Maltese Ħelow Mandarin Chinese 你好 nǐ hǎo Maori Kia ora Norwegian Hei Odia ନମସ୍କାର Namaskār Polish Cześć or Hej Portuguese Oi Romanian alo or salut Russian Здравствуйте Zdrahstvootye or Привет Preevyet Scottish Gaelic Haló Serbian Здраво Zdravo Shanghainese 侬好 noŋ hɔ Slovak ahoj Spanish Hola Swabian Grüss Gott Swahili Hujambo Swedish Hej or Hallá Swiss German (Informal) Hoi Swiss German (Plural, Formal) Grüezi mitenand Swiss German (Singular, Formal) Grüezi Tamil வனக்கம் vanakkam Telugu నమస్కారం namaskāram Thai (Female) สวัสดีค่ะ sawatdeekha Thai (Male) สวัสดีครับ sawatdeekhrap Turkish Merhaba Vietnamese Xin chào Woiworung Womenjeka Yiddish שלום Sholem |
Hi,
I am facing this error while running this program.Can you please help?
Indexing events with OnePass using cutoff of 5
Computing event counts… done. 70 events
Indexing… Exception in thread “main” java.lang.NoSuchMethodError: java.util.Comparator.comparingInt(Ljava/util/function/ToIntFunction;)Ljava/util/Comparator;
at opennlp.tools.ml.model.AbstractDataIndexer.toIndexedStringArray(AbstractDataIndexer.java:236)
at opennlp.tools.ml.model.AbstractDataIndexer.index(AbstractDataIndexer.java:186)
at opennlp.tools.ml.model.OnePassDataIndexer.index(OnePassDataIndexer.java:54)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:68)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:90)
at opennlp.tools.langdetect.LanguageDetectorME.train(LanguageDetectorME.java:92)
at mqg.support.chatbot.OpenNLPLanguageDetectionExamle.main(OpenNLPLanguageDetectionExamle.java:36)
Which JDK version are you using? As per exception, it looks like “java.util.Comparator.comparingInt” method is not found. This method was introduced in JDK 8 as per API documentation.
https://docs.oracle.com/javase/8/docs/api/java/util/Comparator.html#comparingInt-java.util.function.ToIntFunction-
Please make sure that you are using JDK 8 or more.