In this article we will create a simple example of part of speech (POS) tagging feature of ‘Natural Language Processing‘ (NLP) aspect of ‘Artificial Intelligence‘ using Apache OpenNLP API in Java. POS tagging is a process of analyzing grammatical structure of a sentence & detect grammatical category of each word like verb, noun etc.
Example for document categorizer in this article
- Train a model with samples of sentences marked with grammatical category i.e. part-of-speech or POS tag.
- Do some tests with other sentences (input from console) & verify that correct POS tags are provided.
Creating training data
As per javadoc documentation of WordTagSampleStream format for training data file is as followed.
“A stream filter which reads a sentence per line which contains words and tags in word_tag format“
Lets create such a file. Here you can find list of POS tags. For simplicity, you can use one of the free online pos tagger like this one. We will use this tool create with input as “itsallbinary is a blogging website with very good articles. I like this website.” & use output to create training data file as given below.
1 2 |
itsallbinary_NNP is_VBZ a_DT blogging_JJ website_NN with_IN very_JJ good_JJ articles_NNS ._. I_PRP like_VBP this_DT website_NN ._. |
Lets train a model & test with sentences
We will use POSTaggerFactory with POSTaggerME to create a model i.e. POSModel. Once we have a model trained POSTaggerME we will use it to tag tokens from other sentences. We will also need tokenizer feature to tokenize sentences. Read here in details about tokenizer with example.
Here is the code for POS tagger training & testing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
package com.itsallbinary.nlp; import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.nio.charset.StandardCharsets; import java.util.Scanner; import opennlp.tools.postag.POSModel; import opennlp.tools.postag.POSSample; import opennlp.tools.postag.POSTaggerFactory; import opennlp.tools.postag.POSTaggerME; import opennlp.tools.postag.WordTagSampleStream; import opennlp.tools.tokenize.TokenizerME; import opennlp.tools.tokenize.TokenizerModel; import opennlp.tools.util.FilterObjectStream; import opennlp.tools.util.InputStreamFactory; import opennlp.tools.util.MarkableFileInputStreamFactory; import opennlp.tools.util.ObjectStream; import opennlp.tools.util.PlainTextByLineStream; import opennlp.tools.util.TrainingParameters; import opennlp.tools.util.model.ModelUtil; public class OpenNLPPOSTaggerExample { public static void main(String[] args) throws Exception { /** * Read human understandable data & train a model */ // Read file with examples of pos tags. InputStreamFactory inputStreamFactory = new MarkableFileInputStreamFactory(new File("postagdata.txt")); ObjectStream<String> lineStream = new PlainTextByLineStream(inputStreamFactory, StandardCharsets.UTF_8); FilterObjectStream<String, POSSample> sampleStream = new WordTagSampleStream(lineStream); // Train a model from the file read above TrainingParameters params = ModelUtil.createDefaultTrainingParameters(); params.put(TrainingParameters.CUTOFF_PARAM, 0); POSModel model = POSTaggerME.train("en", sampleStream, params, new POSTaggerFactory()); // Serialize model to some file so that next time we don't have to again train a // model. Next time We can just load this file directly into model. model.serialize(new File("postagdata.bin")); /** * Lets tag sentences */ try (Scanner scanner = new Scanner(System.in)) { while (true) { // Get inputs in loop System.out.println("Enter a sentence:"); // Initialize POS tagger tool POSTaggerME myCategorizer = new POSTaggerME(model); // Tag sentence. String[] tokens = myCategorizer.tag(getTokens(scanner.nextLine())); for (String t : tokens) { System.out.println("Tokens: " + t); } } } catch (Exception e) { e.printStackTrace(); } } /** * Tokenize sentence into tokens. * * @param sentence * @return */ private static String[] getTokens(String sentence) { try (InputStream modelIn = new FileInputStream("tokenz.bin")) { TokenizerModel model = new TokenizerModel(modelIn); TokenizerME myCategorizer = new TokenizerME(model); String[] tokens = myCategorizer.tokenize(sentence); for (String t : tokens) { System.out.println("Tokens: " + t); } return tokens; } catch (Exception e) { e.printStackTrace(); } return null; } } |
Output:
As you can see in output below, out input sentence “I like itsallbinary” is tagged correctly as pr out training data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
Indexing events with TwoPass using cutoff of 0 Computing event counts... done. 15 events Indexing... done. Sorting and merging events... done. Reduced 15 events to 15. Done indexing in 0.21 s. Incorporating indexed data for training... done. Number of Event Tokens: 15 Number of Outcomes: 10 Number of Predicates: 169 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-34.538776394910684 0.06666666666666667 ...<skipped>... 100: ... loglikelihood=-0.2395678041919296 1.0 Enter a sentence: I like itsallbinary Tokens: I Tokens: like Tokens: itsallbinary Tokens: PRP Tokens: VBP Tokens: NNP Enter a sentence: |