In this article we will create our own custom chat bot or automated chat agent. We will do this using Apache OpenNLP API library which provides “Natural Language Processing” in Java. “Natural Language Processing” is a branch of “Artificial Intelligence” through which human language is processed in a way that machines can understand it, use it & act on it.
If you are completely new to “Natural Language Processing” aspect of “Artificial Intelligence” then you can through this simple & example based tutorial to get started.
Tutorial | Natural Language Processing in Java using Apache OpenNLP
Example in this article for Chat Bot
- User will chat with Chat Bot using console (to keep example simple).
- Chat bot will be for a hypothetical product (mobile phone) selling company.
- User will inquire about product like product features, price etc.
- Chat bot will reply with greetings, answers to questions about product etc.
High level approach & flow for chat bot program
This short-n-quick video will give you high level design approach that we will take in our code example. This is a very simple & basic approach which will use many features of “Natural Language Processing”.
Models for our example
If you have gone through the basics tutorial of Apache OpenNLP, you are aware that we need either trained serialized models & raw samples data for different features of Open NLP.
In tutorials, we trained our own models for everything. For this example, we will use trained models from different sources (except for categorizer) so that we can focus on our chat bot code. All model files also available in our GitHub repository provided towards end of this article.
- Sentence Detection, Tokenizer, POS Tagger model
- Link- http://opennlp.sourceforge.net/models-1.5/
- This is referred by Apache in their documentation.
- These are serialized trained models so we can directly use them.
- Lemmatizer model –
- Link – Github
- Could not find lemmatizer model from Apache so using this from public github repository. You can use any other model if you have.
- This is not a serialized model. These are training samples, so we have to train & serialize our own model.
- Categorizer model –
- As explained in video, we want to categorize or classify users input into certain categories so that our code knows what to respond. We will create our own custom model for categorizer.
- Categories: To keep chat simple, lets define below categories. You can add or refine categories to experiment & improve.
- greeting – Basic greetings that we anticipate user to use to start chat.
- conversation-continue – Words like “ok”, “hmm” that user might use in between of conversation.
- conversation-complete – Words or sentences that user might end to end conversation.
- product-inquiry – Questions that user might ask to inquire about product or its features.
- price-inquiry – Questions that user might ask to unquire about price of product
Lets create some samples data using above categories.
1 2 3 4 5 |
greeting hi hi. hello how are you ? hey . hows it going good morning good evening good afternoon howdy conversation-continue ok wow great oh is it ? oh ohh Thats great Good Really hmm lemme think hm hmmm hmmmmm price-inquiry How much is the cost ? What's the price ? What's the cost ? How much is price? How much it costs ? price cost money how much. how much? product-inquiry What is this product ? What's the product ? What does this product do ? What is name of product ? What this product is all about? Can you tell me something about this product ? What are the product features ? Tell me feature of product. conversation-complete Thanks done that's all thats it. thats it I am done Thank you for information Thank. thanks. thank you. thanks nothing else thats it. |
Here is the code to train a model for categorizer using above sample.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
/** * Train categorizer model as per the category sample training data we created. * * @return * @throws FileNotFoundException * @throws IOException */ private static DoccatModel trainCategorizerModel() throws FileNotFoundException, IOException { // faq-categorizer.txt is a custom training data with categories as per our chat // requirements. InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("faq-categorizer.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); ObjectStream<DocumentSample> sampleStream = new DocumentSampleStream(lineStream); DoccatFactory factory = new DoccatFactory(new FeatureGenerator[] { new BagOfWordsFeatureGenerator() }); TrainingParameters params = ModelUtil.createDefaultTrainingParameters(); params.put(TrainingParameters.CUTOFF_PARAM, 0); // Train a model with classifications from above file. DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory); return model; } |
Lets code Chat Bot
Now that we are ready with models and familiar with Apache OpenNLP & the approach that we are going to take, its time to code Chat Bot.
We will prepare & store our answers for each categories in a HashMap so its easy to lookup.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
private static Map<String, String> questionAnswer = new HashMap<>(); /* * Define answers for each given category. */ static { questionAnswer.put("greeting", "Hello, how can I help you?"); questionAnswer.put("product-inquiry", "Product is a TipTop mobile phone. It is a smart phone with latest features like touch screen, blutooth etc."); questionAnswer.put("price-inquiry", "Price is $300"); questionAnswer.put("conversation-continue", "What else can I help you with?"); questionAnswer.put("conversation-complete", "Nice chatting with you. Bbye."); } |
Here is the code for the steps of chat bot.
- It first trains categorizer model with latest samples data.
- Then take input from user (through console).
- Then it will break sentences.
- Tokenize each sentence into words.
- Find POS tags of each words as its required in next step.
- Lemmatize each word using tokens & POS tags. This will make it very easy to categories as we don’t have to have all lemma possibilities in categorizer samples data.
- Categorize lemma tokens & then find answer for detected category.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException { // Train categorizer model to the training data we created. DoccatModel model = trainCategorizerModel(); // Take chat inputs from console (user) in a loop. Scanner scanner = new Scanner(System.in); while (true) { // Get chat input from user. System.out.println("##### You:"); String userInput = scanner.nextLine(); // Break users chat input into sentences using sentence detection. String[] sentences = breakSentences(userInput); String answer = ""; boolean conversationComplete = false; // Loop through sentences. for (String sentence : sentences) { // Separate words from each sentence using tokenizer. String[] tokens = tokenizeSentence(sentence); // Tag separated words with POS tags to understand their gramatical structure. String[] posTags = detectPOSTags(tokens); // Lemmatize each word so that its easy to categorize. String[] lemmas = lemmatizeTokens(tokens, posTags); // Determine BEST category using lemmatized tokens used a mode that we trained // at start. String category = detectCategory(model, lemmas); // Get predefined answer from given category & add to answer. answer = answer + " " + questionAnswer.get(category); // If category conversation-complete, we will end chat conversation. if ("conversation-complete".equals(category)) { conversationComplete = true; } } // Print answer back to user. If conversation is marked as complete, then end // loop & program. System.out.println("##### Chat Bot: " + answer); if (conversationComplete) { break; } } } |
Here are the methods for sentence detection, tokenizing, POS or part-of-speech tagging, lemmatizing & categorizing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
/** * Detect category using given token. Use categorizer feature of Apache OpenNLP. * * @param model * @param finalTokens * @return * @throws IOException */ private static String detectCategory(DoccatModel model, String[] finalTokens) throws IOException { // Initialize document categorizer tool DocumentCategorizerME myCategorizer = new DocumentCategorizerME(model); // Get best possible category. double[] probabilitiesOfOutcomes = myCategorizer.categorize(finalTokens); String category = myCategorizer.getBestCategory(probabilitiesOfOutcomes); System.out.println("Category: " + category); return category; } /** * Break data into sentences using sentence detection feature of Apache OpenNLP. * * @param data * @return * @throws FileNotFoundException * @throws IOException */ private static String[] breakSentences(String data) throws FileNotFoundException, IOException { // Better to read file once at start of program & store model in instance // variable. but keeping here for simplicity in understanding. try (InputStream modelIn = new FileInputStream("en-sent.bin")) { SentenceDetectorME myCategorizer = new SentenceDetectorME(new SentenceModel(modelIn)); String[] sentences = myCategorizer.sentDetect(data); System.out.println("Sentence Detection: " + Arrays.stream(sentences).collect(Collectors.joining(" | "))); return sentences; } } /** * Break sentence into words & punctuation marks using tokenizer feature of * Apache OpenNLP. * * @param sentence * @return * @throws FileNotFoundException * @throws IOException */ private static String[] tokenizeSentence(String sentence) throws FileNotFoundException, IOException { // Better to read file once at start of program & store model in instance // variable. but keeping here for simplicity in understanding. try (InputStream modelIn = new FileInputStream("en-token.bin")) { // Initialize tokenizer tool TokenizerME myCategorizer = new TokenizerME(new TokenizerModel(modelIn)); // Tokenize sentence. String[] tokens = myCategorizer.tokenize(sentence); System.out.println("Tokenizer : " + Arrays.stream(tokens).collect(Collectors.joining(" | "))); return tokens; } } /** * Find part-of-speech or POS tags of all tokens using POS tagger feature of * Apache OpenNLP. * * @param tokens * @return * @throws IOException */ private static String[] detectPOSTags(String[] tokens) throws IOException { // Better to read file once at start of program & store model in instance // variable. but keeping here for simplicity in understanding. try (InputStream modelIn = new FileInputStream("en-pos-maxent.bin")) { // Initialize POS tagger tool POSTaggerME myCategorizer = new POSTaggerME(new POSModel(modelIn)); // Tag sentence. String[] posTokens = myCategorizer.tag(tokens); System.out.println("POS Tags : " + Arrays.stream(posTokens).collect(Collectors.joining(" | "))); return posTokens; } } /** * Find lemma of tokens using lemmatizer feature of Apache OpenNLP. * * @param tokens * @param posTags * @return * @throws InvalidFormatException * @throws IOException */ private static String[] lemmatizeTokens(String[] tokens, String[] posTags) throws InvalidFormatException, IOException { // Better to read file once at start of program & store model in instance // variable. but keeping here for simplicity in understanding. try (InputStream modelIn = new FileInputStream("en-lemmatizer.bin")) { // Tag sentence. LemmatizerME myCategorizer = new LemmatizerME(new LemmatizerModel(modelIn)); String[] lemmaTokens = myCategorizer.lemmatize(tokens, posTags); System.out.println("Lemmatizer : " + Arrays.stream(lemmaTokens).collect(Collectors.joining(" | "))); return lemmaTokens; } } |
Lets chat now
Now comes the interesting point. Lets chat with our Chat Bot.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
......<Skipped training log>......... ##### You: Hey Sentence Detection: Hey Tokenizer : Hey POS Tags : UH Lemmatizer : hey Category: greeting ##### Chat Bot: Hello, how can I help you? ##### You: I wanted to know about the product Sentence Detection: I wanted to know about the product Tokenizer : I | wanted | to | know | about | the | product POS Tags : PRP | VBD | TO | VB | IN | DT | NN Lemmatizer : i | want | to | know | about | the | product Category: product-inquiry ##### Chat Bot: Product is a TipTop mobile phone. It is a smart phone with latest features like touch screen, blutooth etc. ##### You: wow great Sentence Detection: wow great Tokenizer : wow | great POS Tags : NN | JJ Lemmatizer : wow | great Category: conversation-continue ##### Chat Bot: What else can I help you with? ##### You: Can you also tell me price Sentence Detection: Can you also tell me price Tokenizer : Can | you | also | tell | me | price POS Tags : MD | PRP | RB | VB | PRP | NN Lemmatizer : can | you | also | tell | i | price Category: price-inquiry ##### Chat Bot: Price is $300 ##### You: ok Sentence Detection: ok Tokenizer : ok POS Tags : NN Lemmatizer : ok Category: conversation-continue ##### Chat Bot: What else can I help you with? ##### You: Thats it for now, thanks Sentence Detection: Thats it for now, thanks Tokenizer : Thats | it | for | now | , | thanks POS Tags : NNS | PRP | IN | RB | , | NNS Lemmatizer : that | it | for | now | , | thanks Category: conversation-complete ##### Chat Bot: Nice chatting with you. Bbye. |
You can fine complete code of this example including model files in our GitHub repository.
Further improvements you can experiment
- You can further improve categorizer sample data by adding more categories, adding more samples.
- You can try adding “Language detection” feature of Apache OpenNLP to detect which language user is using. If user is using some other language then you can request for specific language. Find pre-trained serialized model for language here on Apache site.
Further Reading
You might be interested in these articles as well.
Create your own video conference web application using Java & JavaScript