Language Detector Example in Apache OpenNLP
This tutorial explains how to train and use an Apache OpenNLP language detector with LanguageDetectorME. The example loads labelled training text, trains a language detection model, predicts the language of a new sentence, and prints confidence scores for the possible language labels.
In Apache OpenNLP, language detection is similar to document categorization: each training sample is a piece of text, and the first token in that line is the language label. The label can be a language code such as eng, spa, or any consistent code used in your training file. The model returns the same labels during prediction.
Apache OpenNLP langdetect package and project setup
The original version of this tutorial was written when the langdetect package had just been merged into opennlp-master on GitHub. In modern OpenNLP projects, first check whether the opennlp.tools.langdetect package is already available through your opennlp-tools dependency. If you are working with an older source checkout, you can still build the project from source.
To build the project by cloning opennlp-master from github, using maven, follow the instructions in README.md .
Once the project is built, import the project to IDE of your choice like Eclipse, IntelliJ IDEA, etc.
Training file and Code of different methods from opennlp-tools test folder have been taken to put this example to a piece. Feel free to explore some more methods from https://github.com/apache/opennlp/tree/master/opennlp-tools/src/test/java/opennlp/tools/langdetect.
If you are using Maven, add the OpenNLP tools dependency in your project and replace the version with the OpenNLP version used in your application.
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>YOUR_OPENNLP_VERSION</version>
</dependency>
Apache OpenNLP language training data format
The language detector needs one labelled sample per line. The first token is the language label, followed by a whitespace character, followed by the text sample. Keep the file in UTF-8 so that accents, Indian-language characters, and other non-ASCII text are read correctly.
eng This is a sample English sentence for language detection.
spa Esta es una oración de ejemplo en español.
fra Ceci est une phrase française d'exemple.
hin यह भाषा पहचान के लिए एक हिंदी वाक्य है।
And by the way, the structure of training data is similar to that of document categorizer. Each line in the training file belongs to a language and the first word in the line is the actual language name. Language name and data in the line should be separated by a white space character.
Refer DoccatSample.txt for the training file. It is useful for learning the format, but for real language detection you should train with many representative samples for every language you want to detect.
Steps to Train and Use Apache OpenNLP LanguageDetectorME
Following are the steps to learn how to use LanguageDetector from Apache OpenNLP.
Step 1: Load UTF-8 language training data into LanguageDetectorSampleStream
Load the training data into LanguageDetectorSampleStream. The sample stream reads the labelled lines and converts them into language detector samples.
LanguageDetectorSampleStream sampleStream = null;
try {
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("training-data" + File.separator + "DoccatSample.txt"));
ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
sampleStream = new LanguageDetectorSampleStream(lineStream);
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Make sure the file path points to your training file. In this example, the file is expected at training-data/DoccatSample.txt relative to the project working directory.
Step 2: Define Apache OpenNLP language detector training parameters
Training parameters control how the model is trained. The example below sets the number of iterations, cutoff, data indexer, and algorithm.
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 100);
params.put(TrainingParameters.CUTOFF_PARAM, 5);
params.put("DataIndexer", "TwoPass");
params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");
Training parameters are the ones used by the training algorithm, and also you can specify the algorithm to be used to train the language detection trainer.
Some of the training parameters are number of iterations, cutoff, algorithm, etc. The best values depend on the amount and quality of training data. For a small demo file, these values are enough to show the flow. For production use, evaluate the model on separate test data before trusting the predictions.
Step 3: Train the Apache OpenNLP language detector model
Train the model by passing the sample stream, training parameters, and a LanguageDetectorFactory to LanguageDetectorME.train().
LanguageDetectorModel model = LanguageDetectorME.train(sampleStream, params, new LanguageDetectorFactory());
Step 4: Predict language labels with Apache OpenNLP LanguageDetector
Once the model is built, we can load the model to use it for prediction. We shall print the confidence scores for the possible languages from the model for the test data.
LanguageDetector ld = new LanguageDetectorME(model);
Language[] languages = ld.predictLanguages("estava em uma marcenaria na Rua Bruno");
System.out.println("Predicted languages..");
for(Language language:languages){
System.out.println(language.getLang()+" confidence:"+language.getConfidence());
}
The method predictLanguages() returns multiple possible language labels, ordered by model confidence. If you need only the top prediction, use the highest-confidence result or the single-language prediction method available in your OpenNLP version.
Complete Java Example for Apache OpenNLP Language Detector
LanguageDetectorMEExample.java
import java.io.*;
import opennlp.tools.langdetect.*;
import opennlp.tools.util.*;
/**
* Language Detector Example in Apache OpenNLP
*/
public class LanguageDetectorMEExample {
private static LanguageDetectorModel model;
public static void main(String[] args){
// loading the training data to LanguageDetectorSampleStream
LanguageDetectorSampleStream sampleStream = null;
try {
InputStreamFactory dataIn = new MarkableFileInputStreamFactory(new File("training-data" + File.separator + "DoccatSample.txt"));
ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
sampleStream = new LanguageDetectorSampleStream(lineStream);
} catch (FileNotFoundException e){
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// training parameters
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 100);
params.put(TrainingParameters.CUTOFF_PARAM, 5);
params.put("DataIndexer", "TwoPass");
params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");
// train the model
try {
model = LanguageDetectorME.train(sampleStream, params, new LanguageDetectorFactory());
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("Completed");
// load the model
LanguageDetector ld = new LanguageDetectorME(model);
// use model for predicting the language
Language[] languages = ld.predictLanguages("estava em uma marcenaria na Rua Bruno");
System.out.println("Predicted languages..");
for(Language language:languages){
// printing the language and the confidence score for the test data to belong to the language
System.out.println(language.getLang()+" confidence:"+language.getConfidence());
}
}
}
Output :
/usr/lib/jvm/default-java/bin/java -javaagent:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/lib/idea_rt.jar=43869:/media/arjun/0AB650F1B650DF2F/SOFTs/ubuntu/idea-IC-171.4249.39/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/default-java/jre/lib/charsets.jar:/usr/lib/jvm/default-java/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/default-java/jre/lib/ext/dnsns.jar:/usr/lib/jvm/default-java/jre/lib/ext/icedtea-sound.jar:/usr/lib/jvm/default-java/jre/lib/ext/jaccess.jar:/usr/lib/jvm/default-java/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/default-java/jre/lib/ext/localedata.jar:/usr/lib/jvm/default-java/jre/lib/ext/nashorn.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunec.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/default-java/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/default-java/jre/lib/ext/zipfs.jar:/usr/lib/jvm/default-java/jre/lib/jce.jar:/usr/lib/jvm/default-java/jre/lib/jfxswt.jar:/usr/lib/jvm/default-java/jre/lib/jsse.jar:/usr/lib/jvm/default-java/jre/lib/management-agent.jar:/usr/lib/jvm/default-java/jre/lib/resources.jar:/usr/lib/jvm/default-java/jre/lib/rt.jar:/home/arjun/workspace/opennlp-master/opennlp-tools/target/classes LanguageDetectorMEExample
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 99 events
Indexing... done.
Collecting events... Done indexing in 1.35 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 99
Number of Outcomes: 4
Number of Predicates: 4849
Computing model parameters...
Stats: (25/99) 0.25252525252525254
...done.
Completed
Predicted languages..
pob confidence:0.9998990013343246
ita confidence:1.0030518375770318E-4
spa confidence:6.934808895132994E-7
fra confidence:1.0283097500463277E-12
Process finished with exit code 0
In this sample output, pob is the top predicted label because that label exists in the training data. If your file uses por or pt for Portuguese, the output label will use that code instead. OpenNLP does not rename the label automatically.
Save and Reuse an Apache OpenNLP LanguageDetectorModel
In real applications, do not train the language detector every time the program starts. Train the model once, serialize it to a file, and load the saved model for prediction.
try (OutputStream modelOut = new BufferedOutputStream(new FileOutputStream("langdetect.bin"))) {
model.serialize(modelOut);
}
try (InputStream modelIn = new BufferedInputStream(new FileInputStream("langdetect.bin"))) {
LanguageDetectorModel loadedModel = new LanguageDetectorModel(modelIn);
LanguageDetector detector = new LanguageDetectorME(loadedModel);
Language bestLanguage = detector.predictLanguage("This is a test sentence.");
System.out.println(bestLanguage.getLang() + " confidence:" + bestLanguage.getConfidence());
}
Model reuse is usually the preferred approach in web services, batch jobs, and desktop applications because prediction is much faster than repeated training.
Common Issues in Apache OpenNLP Language Detection
- Very short text: A single word or two-word phrase may not contain enough clues to identify the language reliably.
- Small training data: A few lines per language may run, but it usually does not give a dependable model.
- Mixed-language input: A sentence containing two or more languages may return the language that dominates the text or vocabulary.
- Uneven training samples: If one language has many more samples than another, the model may become biased toward the larger class.
- Wrong file encoding: Use UTF-8 consistently while preparing the training file and while reading it with
PlainTextByLineStream. - Unexpected output code: OpenNLP returns the label used in the first column of your training data. Check your training file if the output code looks unfamiliar.
Apache OpenNLP Language Detector FAQs
What is LanguageDetectorME in Apache OpenNLP?
LanguageDetectorME is the OpenNLP implementation used to train and run a statistical language detector. It takes text as input and returns one or more predicted language labels with confidence scores.
What should the Apache OpenNLP language detection training file contain?
The training file should contain one labelled sample per line. The first token must be the language label, followed by whitespace and then the text sample for that language.
Can Apache OpenNLP detect languages without training a model?
This example trains a custom model from labelled data. If your OpenNLP setup provides a ready-made language detection model, you can load that model directly. Otherwise, you need to train your own model before prediction.
Why does the OpenNLP language detector return codes like pob, ita, spa and fra?
The detector returns the exact labels found in the training data. In the sample output, pob, ita, spa, and fra are the outcome labels from the training file, not names generated by OpenNLP.
Is confidence from Apache OpenNLP language detection always reliable?
Confidence scores should be interpreted together with input length, training data quality, and the list of languages in the model. A high score on a tiny or biased training set does not guarantee real-world accuracy.
Apache OpenNLP Language Detection QA Checklist
- Confirm that the training file is saved in UTF-8 and read with the same encoding.
- Confirm that every training line starts with exactly one language label followed by sample text.
- Use consistent labels such as
eng,spa,fra, or your chosen project-specific codes. - Train with enough samples for every language included in the detector.
- Test the trained model with text that was not used during training.
- Serialize the trained model and load it for prediction instead of retraining on every run.
Conclusion for Apache OpenNLP Language Detector Example
In this Apache OpenNLP Tutorial, we have learnt how to use Language Detector in Apache OpenNLP, an NLP library. The important points are to prepare labelled UTF-8 training data, choose suitable training parameters, train a LanguageDetectorModel, and reuse the saved model for predictions.
TutorialKart.com