Document Classification using NGram Features in OpenNLP
In this tutorial, we shall learn how to use NGram features for Document Classification in OpenNLP using a Java example. N-grams help the document categorizer learn not only individual words, but also short word sequences such as two-word and three-word phrases that may be useful for category prediction.
This topic is kind of continuation to document classification using Maxent model or document classification using Naive Bayes model, where a detailed explanation has been given on how to train a model for document classification or categorization with default features incorporated in DoccatFactory.
What NGramFeatureGenerator Adds to OpenNLP Document Classification
The default document categorizer can work with word-level features. NGramFeatureGenerator extends this by creating features from continuous groups of words. For example, from the text romantic comedy movie, a unigram generator can create romantic, comedy, and movie, while a bigram generator can create romantic comedy and comedy movie.
This is useful when the meaning of a phrase is stronger than the meaning of individual words. In document classification, phrases such as love story, crime scene, stock market, or customer support may help the model separate one category from another more clearly.
OpenNLP DoccatFactory Configuration with Unigram, Bigram, and Trigram Features
Following is the snippet of Java code, where we try to define and initialize N-gram feature generators that could be used for Document Categorizer.
FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
new NGramFeatureGenerator(2,3) };
DoccatFactory factory = new DoccatFactory(featureGenerators);
featureGenerators is an array where a list of feature generators, which implement the FeatureGenerator interface, could be provided. You may build your own class of feature generator extending FeatureGenerator and use the same for document categorizer, by just adding it in the list.
The arguments passed in “new NGramFeatureGenerator(2,3)”, i.e., 2, 3 are minimum and maximum number of words respectively that should be considered as a feature. For more information onNGramFeatureGenerator, please refer the java documentation of NGramFeatureGenerator.
In the above configuration, new NGramFeatureGenerator(1,1) creates unigram features, and new NGramFeatureGenerator(2,3) creates bigram and trigram features. The resulting DoccatFactory is then passed to the training method so that these N-gram features are used while building the model.
Training Data Format for OpenNLP Document Categorizer with NGram Features
The document categorizer training file should contain one training sample per line. The first token is the category label, and the remaining tokens form the document text. Keep labels consistent, because labels such as Romantic and romantic are treated as different categories.
Romantic A love story about two people who meet during a journey
Thriller A detective follows clues after a mysterious crime scene
Romantic The film focuses on relationships family and emotional choices
Thriller The story includes a chase investigation and hidden evidence
N-gram features usually increase the number of predicates produced during training. Because of this, use enough representative training examples for every category. With very small training data, a model can appear accurate on the sample used for testing but fail on new documents.
Example – Document Classification using NGram Features in OpenNLP
Complete program that takes in a training file, incorporates NGramFeatureGenerator, and generates a model is as shown in the following.
DocClassificationNGramFeaturesDemo.java
import java.io.BufferedOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import opennlp.tools.doccat.BagOfWordsFeatureGenerator;
import opennlp.tools.doccat.DoccatFactory;
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizer;
import opennlp.tools.doccat.DocumentCategorizerME;
import opennlp.tools.doccat.DocumentSample;
import opennlp.tools.doccat.DocumentSampleStream;
import opennlp.tools.doccat.FeatureGenerator;
import opennlp.tools.doccat.NGramFeatureGenerator;
import opennlp.tools.util.InputStreamFactory;
import opennlp.tools.util.MarkableFileInputStreamFactory;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;
/**
* oepnnlp version 1.7.2
* Usage of NGram features for Document Classification in OpenNLP
* @author www.tutorialkart.com
*/
public class DocClassificationNGramFeaturesDemo {
public static void main(String[] args) {
try {
// read the training data
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("train"+File.separator+"en-movie-category.train"));
ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream sampleStream = new DocumentSampleStream(lineStream);
// define the training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 10+"");
params.put(TrainingParameters.CUTOFF_PARAM, 0+"");
// feature generators - N-gram feature generators
FeatureGenerator[] featureGenerators = { new NGramFeatureGenerator(1,1),
new NGramFeatureGenerator(2,3) };
DoccatFactory factory = new DoccatFactory(featureGenerators);
// create a model from traning data
DoccatModel model = DocumentCategorizerME.train("en", sampleStream, params, factory);
System.out.println("\nModel is successfully trained.");
// save the model to local
BufferedOutputStream modelOut = new BufferedOutputStream(new FileOutputStream("model"+File.separator+"en-movie-classifier-maxent.bin"));
model.serialize(modelOut);
System.out.println("\nTrained Model is saved locally at : "+"model"+File.separator+"en-movie-classifier-maxent.bin");
// test the model file by subjecting it to prediction
DocumentCategorizer doccat = new DocumentCategorizerME(model);
String[] docWords = "Afterwards Stuart and Charlie notice Kate in the photos Stuart took at Leopolds ball and realise that her destiny must be to go back and be with Leopold That night while Kate is accepting her promotion at a company banquet he and Charlie race to meet her and show her the pictures Kate initially rejects their overtures and goes on to give her acceptance speech but it is there that she sees Stuarts picture and realises that she truly wants to be with Leopold".replaceAll("[^A-Za-z]", " ").split(" ");
double[] aProbs = doccat.categorize(docWords);
// print the probabilities of the categories
System.out.println("\n---------------------------------\nCategory : Probability\n---------------------------------");
for(int i=0;i<doccat.getNumberOfCategories();i++){
System.out.println(doccat.getCategory(i)+" : "+aProbs[i]);
}
System.out.println("---------------------------------");
System.out.println("\n"+doccat.getBestCategory(aProbs)+" : is the predicted category for the given sentence.");
}
catch (IOException e) {
System.out.println("An exception in reading the training file. Please check.");
e.printStackTrace();
}
}
}
The training file could be downloaded from here.
Output
Indexing events using cutoff of 0
Computing event counts... done. 66 events
Indexing... done.
Sorting and merging events... done. Reduced 66 events to 66.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 66
Number of Outcomes: 2
Number of Predicates: 74080
...done.
Computing model parameters ...
Performing 10 iterations.
1: ... loglikelihood=-45.747713916956386 0.4090909090909091
2: ... loglikelihood=-39.482448265755195 1.0
3: ... loglikelihood=-34.73809942995604 1.0
4: ... loglikelihood=-31.01773543201995 1.0
5: ... loglikelihood=-28.021100513571625 1.0
6: ... loglikelihood=-25.55532624366708 1.0
7: ... loglikelihood=-23.490627352875972 1.0
8: ... loglikelihood=-21.736377961873213 1.0
9: ... loglikelihood=-20.227350308507212 1.0
10: ... loglikelihood=-18.915391558485368 1.0
Model is successfully trained.
Trained Model is saved locally at : model/en-movie-classifier-maxent.bin
---------------------------------
Category : Probability
---------------------------------
Thriller : 0.4912161738321056
Romantic : 0.5087838261678945
---------------------------------
Romantic : is the predicted category for the given sentence.
Reading the OpenNLP NGram Classification Output
The output first shows the training progress. The line Number of Outcomes: 2 indicates that the training data contains two categories. The line Number of Predicates: 74080 shows that a large number of features were generated, which is expected when unigrams, bigrams, and trigrams are used together.
After training, the program prints the probability for each category and then prints the best category returned by doccat.getBestCategory(aProbs). In this example, the test text is classified as Romantic because that category has the higher probability.
Choosing NGram Ranges for OpenNLP Document Categorization
The N-gram range should match the kind of phrase patterns you expect in the documents. Bigger N-grams can capture more context, but they also create more features and require more training data.
| NGramFeatureGenerator | Features captured | When to use |
|---|---|---|
new NGramFeatureGenerator(1,1) | Single words | Good baseline for most document classification tasks. |
new NGramFeatureGenerator(2,2) | Two-word phrases | Useful when phrase meaning is important, such as customer support or crime scene. |
new NGramFeatureGenerator(2,3) | Two-word and three-word phrases | Useful when short expressions identify categories, but it needs more training data. |
new NGramFeatureGenerator(1,3) | Single words, two-word phrases, and three-word phrases | Useful for experiments, but monitor model size and overfitting. |
For a new classifier, start with unigrams. Then add bigrams or trigrams only when you have enough examples and phrase-level information is likely to help. The Apache OpenNLP manual is also useful for checking the document categorizer workflow and training API behavior: Apache OpenNLP manual.
Common Issues with NGram Features in OpenNLP Document Classification
- Too few training samples: N-grams generate many features. If each category has only a few examples, the model may learn phrases from the sample file instead of learning general patterns.
- Inconsistent text cleaning: Use similar tokenization and cleanup for both training and prediction. In the example, the prediction text removes non-alphabetic characters before splitting into words.
- Very high feature count: If training becomes slow or the model becomes too large, reduce the maximum N-gram size or use a higher cutoff value.
- Confusing category labels: Keep category names stable and avoid spaces in labels unless your training format handles them correctly.
- Testing on training-like text only: Always test the trained model on documents that were not used during training.
QA Checklist for an OpenNLP NGram Document Classifier
- Confirm that every line in the training file starts with one valid category label followed by document text.
- Verify that training and prediction apply the same text cleanup rules.
- Compare a unigram-only model with a unigram-plus-N-gram model before choosing the final feature set.
- Check whether the number of predicates is reasonable for the size of the training data.
- Evaluate the model on separate test documents instead of relying only on the training output.
FAQs on NGram Features for OpenNLP Document Classification
What is NGram text classification in OpenNLP?
NGram text classification in OpenNLP means using word sequences as features while training a document categorizer. Instead of relying only on individual words, the model can also learn short phrases such as two-word and three-word combinations.
How do I do document classification with NGram features in OpenNLP?
Create a training file with category labels and document text, define one or more NGramFeatureGenerator objects, pass them to DoccatFactory, and use that factory while training DocumentCategorizerME.
Should I use unigrams, bigrams, or trigrams for OpenNLP document categorization?
Use unigrams as a baseline. Add bigrams or trigrams when phrases carry category-specific meaning and you have enough training examples to support the larger feature set.
Why does the number of predicates increase when I add NGramFeatureGenerator?
NGramFeatureGenerator creates additional features from word sequences. When you add bigrams and trigrams, the trainer sees many more possible features, so the predicate count can increase significantly.
Can I create my own feature generator for OpenNLP document classification?
Yes. A custom feature generator can be created by implementing the FeatureGenerator interface and adding it to the FeatureGenerator[] used by DoccatFactory.
Summary: NGram Features for OpenNLP Document Categorizer
In this Apache OpenNLP Tutorial, we have learnt how to use an N-Gram Feature Generator for Document Categorizer that helps in document classification. N-gram features are most useful when short phrases contain category-specific information. Start with simple unigram features, add bigrams or trigrams when needed, and evaluate the model with separate test documents before using it for real classification tasks.
TutorialKart.com