Lemmatizer in Apache OpenNLP

Lemmatizer is a Natural Language Processing tool that aims to remove any changes in form of the word like tense, gender, mood, etc. and return dictionary or base form of word.

In Apache OpenNLP, a lemmatizer returns the base or dictionary form of a word, usually called its lemma. For example, the noun cities becomes city, and the verb had becomes have when the correct part-of-speech tag is supplied.

The important detail is that OpenNLP lemmatization is not performed from the token alone. The lemmatizer expects each token and its corresponding POS tag. This is why a sentence is normally processed in this order: tokenize the text, assign POS tags, and then lemmatize the token and tag arrays.

In Apache OpenNLP there are two methods to do Lemmatization.

  • Statistical Lemmatization
  • Dictionary based Lemmatization

Statistical Lemmatizer needs a lemmatizer model(that is built from training data) for finding the lemma of a given word, while the Dictionary based Lemmatizer needs a dictionary(which contains all possible and valid combinations of {word, postag and the corresponding lemma}) .

Input to the Lemmatizer is the set of tokens and corresponding postags. So, to find lemmas for words in a sentence, the prior task is : sentence has to be tokenized using a Tokenizer and then pos tagged using a POS Tagger.

OpenNLP lemmatizer input: tokens and POS tags at the same index

The token array and POS tag array must have the same length. The word at tokens[i] is lemmatized using the tag at tags[i], and the result is stored at lemmas[i]. If these arrays are not aligned, the lemma result will be wrong even when the code runs without a visible error.

TokenPOS tagExpected lemmaReason
citiesNNScityPlural noun converted to singular dictionary form.
hadVBDhavePast-tense verb converted to base verb.
largeJJlargeAdjective is already in base form.
newspapersNNSnewspaperPlural noun converted to singular noun.

DictionaryLemmatizer vs LemmatizerME in Apache OpenNLP

Use DictionaryLemmatizer when you have a dictionary file containing known word, POS tag, and lemma combinations. It is predictable and easy to inspect, but it cannot return a useful lemma when the required entry is missing from the dictionary.

Use LemmatizerME when you have a trained lemmatizer model. This is the statistical lemmatizer in OpenNLP. It can generalize from training data, but its quality depends on the model and the training corpus. For many Java applications, a dictionary lemmatizer is easier to start with, and a statistical model is added later when unknown words must be handled better.

OpenNLP dictionary lemmatizer file format

A dictionary lemmatizer file contains one entry per line. Each entry maps a surface word and POS tag to the lemma that should be returned. Keep the POS tag set consistent with the POS tagger model you use; a dictionary created for one tag set may not match the output of another tagger.

</>
Copy
word<TAB>POS_TAG<TAB>lemma
cities<TAB>NNS<TAB>city
had<TAB>VBD<TAB>have
newspapers<TAB>NNS<TAB>newspaper

If a word and POS tag combination can have more than one possible lemma, OpenNLP dictionary data can represent multiple lemmas separated with #. In simple English examples, one lemma per entry is usually enough.

Example 1 – Dictionary Lemmatizer in Apache OpenNLP

You may download the dictionary from here. And en-pos-maxent.bin from here.

For newer OpenNLP projects, also check the official Apache OpenNLP models page and use model files that match the OpenNLP version in your project.

DictionaryLemmatizerExample.java

</>
Copy
import opennlp.tools.langdetect.*;
import opennlp.tools.lemmatizer.DictionaryLemmatizer;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;

import java.io.*;

/**
 * Dictionary Lemmatizer Example in Apache OpenNLP
 */
public class DictionaryLemmatizerExample {

    public static void main(String[] args){
        try{
            // test sentence
            String[] tokens = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
                    "morning", "and", "afternoon", "newspapers", "."};

            // Parts-Of-Speech Tagging
            // reading parts-of-speech model to a stream
            InputStream posModelIn = new FileInputStream("models"+File.separator+"en-pos-maxent.bin");
            // loading the parts-of-speech model from stream
            POSModel posModel = new POSModel(posModelIn);
            // initializing the parts-of-speech tagger with model
            POSTaggerME posTagger = new POSTaggerME(posModel);
            // Tagger tagging the tokens
            String tags[] = posTagger.tag(tokens);

            // loading the dictionary to input stream
            InputStream dictLemmatizer = new FileInputStream("dictionary"+File.separator+"en-lemmatizer.txt");
            // loading the lemmatizer with dictionary
            DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(dictLemmatizer);

            // finding the lemmas
            String[] lemmas = lemmatizer.lemmatize(tokens, tags);

            // printing the results
            System.out.println("\nPrinting lemmas for the given sentence...");
            System.out.println("WORD -POSTAG : LEMMA");
            for(int i=0;i< tokens.length;i++){
                System.out.println(tokens[i]+" -"+tags[i]+" : "+lemmas[i]);
            }

        } catch (FileNotFoundException e){
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Output

/usr/lib/jvm/default-java/bin/java -javaagent:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/lib/idea_rt.jar=41518:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/default-java/jre/lib/charsets.jar:/usr/lib/jvm/default-java/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/default-java/jre/lib/ext/dnsns.jar:/usr/lib/jvm/default-java/jre/lib/ext/icedtea-sound.jar:/usr/lib/jvm/default-java/jre/lib/ext/jaccess.jar:/usr/lib/jvm/default-java/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/default-java/jre/lib/ext/localedata.jar:/usr/lib/jvm/default-java/jre/lib/ext/nashorn.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunec.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/default-java/jre/lib/ext/zipfs.jar:/usr/lib/jvm/default-java/jre/lib/jce.jar:/usr/lib/jvm/default-java/jre/lib/jfxswt.jar:/usr/lib/jvm/default-java/jre/lib/jsse.jar:/usr/lib/jvm/default-java/jre/lib/management-agent.jar:/usr/lib/jvm/default-java/jre/lib/resources.jar:/usr/lib/jvm/default-java/jre/lib/rt.jar:/home/arjun/workspace/opennlp-master/opennlp-tools/target/classes DictionaryLemmatizerExample

Printing lemmas for the given sentence...
WORD -POSTAG : LEMMA
Most -JJS : much
large -JJ : large
cities -NNS : city
in -IN : in
the -DT : the
US -NNP : O
had -VBD : have
morning -NN : O
and -CC : and
afternoon -NN : O
newspapers -NNS : newspaper
. -. : O

Process finished with exit code 0

Note : If a combination of the word and postag is not found in the dictionary, the lemma is returned as an unknown marker. In the above example the combinations US_NNP, morning_NN, afternoon_NN and ._. are not found in the dictionary, hence the corresponding lemmas are printed as O.

In many OpenNLP lemmatizer outputs, the unknown lemma marker appears as O in the printed output. Treat this as a “no dictionary entry found” result, not as the lemma of the word. In production code, you may replace this marker with the original token, skip the token, or fall back to another lemmatization method depending on your application.

How the DictionaryLemmatizer Java example works

  1. The sentence is represented as a token array.
  2. The POS model is loaded from models/en-pos-maxent.bin.
  3. POSTaggerME assigns one POS tag to each token.
  4. The dictionary file is loaded from dictionary/en-lemmatizer.txt.
  5. DictionaryLemmatizer returns one lemma result for each token and POS tag pair.
  6. The loop prints the token, POS tag, and lemma at the same index.

Minimal Maven dependency for an OpenNLP lemmatizer example

If you are running this example in a Maven project, add the OpenNLP tools dependency. Keep the version compatible with the model files and documentation used in your project.

</>
Copy
<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>2.5.7</version>
</dependency>

If your existing project already uses another OpenNLP version, do not mix versions casually. Use the same OpenNLP library version throughout the application and test the POS tagger and lemmatizer together.

Small LemmatizerME skeleton for statistical OpenNLP lemmatization

The earlier program uses a dictionary. The following skeleton shows the part that changes when you use a trained statistical lemmatizer model instead. It assumes that you already have token and POS tag arrays.

</>
Copy
import opennlp.tools.lemmatizer.LemmatizerME;
import opennlp.tools.lemmatizer.LemmatizerModel;

import java.io.FileInputStream;
import java.io.InputStream;

InputStream modelIn = new FileInputStream("models/en-lemmatizer.bin");
LemmatizerModel lemmatizerModel = new LemmatizerModel(modelIn);
LemmatizerME lemmatizer = new LemmatizerME(lemmatizerModel);

String[] lemmas = lemmatizer.lemmatize(tokens, tags);

Use this approach only when you have a trained lemmatizer model. If you only have a text dictionary file, use DictionaryLemmatizer as shown in the main example.

Lemmatization and stemming difference in NLP examples

Lemmatization and stemming both reduce word forms, but they are not the same. Stemming usually removes suffixes using rules and may produce a shortened form that is not a dictionary word. Lemmatization uses vocabulary and POS information to return a proper dictionary form.

WordPossible stemLemma with correct POS
citiesciticity
runningrunrun
betterbettergood, when used as an adjective comparison
hadhadhave

For search indexing, stemming may be enough in some applications. For NLP pipelines that need meaningful base forms, lemmatization is usually the better fit because the output is closer to the dictionary form of the word.

Common OpenNLP lemmatizer mistakes to avoid

  • Using tokens without POS tags: OpenNLP lemmatizers require both token and POS tag input.
  • Using mismatched tag sets: The dictionary entries must use POS tags that match the POS tagger output.
  • Expecting every word to have a lemma: Dictionary-based lemmatization returns an unknown marker when the word and tag pair is missing.
  • Confusing O with zero: The output marker may look like the letter O. Handle it explicitly in your code.
  • Ignoring punctuation and proper nouns: Punctuation, abbreviations, and names often need separate handling.

Editorial QA checklist for this OpenNLP lemmatizer tutorial

  • Check that every lemmatizer example passes token and POS tag arrays of the same length.
  • Confirm that the dictionary file path in the Java code matches the project folder structure.
  • Verify that the POS model and dictionary use compatible POS tag labels.
  • Explain unknown lemma output clearly as an out-of-dictionary result.
  • Keep dictionary lemmatization and statistical LemmatizerME examples separate so readers know which file type is required.

FAQs on Apache OpenNLP lemmatizer example

What is an example of a lemmatizer in Apache OpenNLP?

An example is DictionaryLemmatizer. It takes tokens such as cities and POS tags such as NNS, looks up the word-tag pair in a dictionary, and returns the lemma city.

What does lemmatize do in OpenNLP?

The lemmatize() method returns lemma values for a token array and a matching POS tag array. The result array has one lemma result at the same index as each input token.

What is the difference between a lemmatizer and a stemmer?

A stemmer cuts or transforms a word using rules and may return a non-dictionary form. A lemmatizer returns a dictionary form such as city for cities or have for had, usually using POS information.

Why does OpenNLP DictionaryLemmatizer return O for some words?

It means the dictionary does not contain that exact word and POS tag combination. Proper nouns, punctuation, and domain-specific words are common cases where a dictionary entry may be missing.

Can OpenNLP lemmatize a sentence without POS tagging first?

Not reliably. OpenNLP lemmatizers expect POS tags along with tokens, so a normal pipeline tokenizes the sentence, applies a POS tagger, and then calls the lemmatizer.

What this Apache OpenNLP lemmatizer example covered

In this OpenNLP Tutorial, we have learnt what is lemmatization and how to implement it, with the help of Lemmatizer Example in Apache OpenNLP. We also covered the required token and POS tag input, the dictionary file format, the difference between dictionary and statistical lemmatizers, and the common unknown-output case in dictionary-based lemmatization.