Tokenizer Example in Apache OpenNLP using Java

This tutorial shows how to tokenize text in Java using Apache OpenNLP. We will use the same sentence with three OpenNLP tokenizer options so that you can clearly see how TokenizerME, WhitespaceTokenizer, and SimpleTokenizer differ in their output.

What tokenization means in an OpenNLP Java tokenizer example

Tokenization is the process of splitting a text string into smaller parts called tokens. In natural language processing, tokens are usually words, punctuation marks, numbers, or other meaningful text units.

For example, consider the sentence John is 26 years old.. A tokenizer may keep the full stop with the word old., or it may return the full stop as a separate token, depending on the tokenizer implementation.

Tokenizer input	Possible token output	What to notice
John is 26 years old.	[John, is, 26, years, old, .]	The period is separated as a punctuation token.
John is 26 years old.	[John, is, 26, years, old.]	The period remains attached to `old.` when only whitespace is used.

Apache OpenNLP tokenizer choices in Java

The Tokenizer API in OpenNLP provides the following commonly used ways to tokenize text:

TokenizerME class loaded with a token model
WhitespaceTokenizer
SimpleTokenizer

Note: The original examples on this page use Apache OpenNLP version 1.7.2. The main idea is the same in later OpenNLP versions, but dependency versions and model download locations may change.

If you are using Maven, keep opennlp-tools on the classpath. For the model-based example, also keep the English tokenizer model file en-token.bin in the working directory, or change the file path in the code.

</>

Copy

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.7.2</version>
</dependency>

Observe the differences in the output from these three tokenization methods in the examples below.

OpenNLP tokenizer	Requires model file?	Best suited for	Output behavior in this tutorial
`TokenizerME`	Yes, for example `en-token.bin`	Model-based tokenization where punctuation and learned token boundaries matter.	Returns `old` and `.` as separate tokens.
`WhitespaceTokenizer`	No	Simple splitting where spaces are the only boundaries you want to consider.	Returns `old.` as one token.
`SimpleTokenizer`	No	Quick rule-based tokenization without loading a model.	Returns `old` and `.` as separate tokens.

TokenizerME class Loaded with a Token Model

TokenizerME is the model-based tokenizer in Apache OpenNLP. Use it when you want token boundaries predicted by a trained tokenizer model instead of only using simple rules.

Step 1: Read the pretrained model into a stream.

</>

Copy

InputStream modelIn = new FileInputStream("en-token.bin");

Step 2: Read the stream to a Tokenizer model.

</>

Copy

TokenizerModel model = new TokenizerModel(modelIn);

Step 3: Initialize the tokenizer with the model.

</>

Copy

TokenizerME tokenizer = new TokenizerME(model);

Step 4: Use TokenizerME.tokenize() method to extract the tokens to a String Array.

</>

Copy

String tokens[] = tokenizer.tokenize("John is 26 years old.");

Step 5: Use TokenizerME.getTokenProbabilities() to get the probabilities for the segments to be tokens.

</>

Copy

double tokenProbs[] = tokenizer.getTokenProbabilities();

Step 6: Finally, print the results.

Putting these steps together, the following program uses the pretrained token model:

</>

Copy

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using TokenizerME class loaded with pre-trained token model
 */
public class TokenizerModelExample {

	public static void main(String[] args) {
		InputStream modelIn = null;

		try {
			modelIn = new FileInputStream("en-token.bin");
			TokenizerModel model = new TokenizerModel(modelIn);
			TokenizerME tokenizer = new TokenizerME(model);
			String tokens[] = tokenizer.tokenize("John is 26 years old.");
			double tokenProbs[] = tokenizer.getTokenProbabilities();
			
			System.out.println("Token\t: Probability\n-------------------------------");
			for(int i=0;i<tokens.length;i++){
				System.out.println(tokens[i]+"\t: "+tokenProbs[i]);
			}
		}
		catch (IOException e) {
			e.printStackTrace();
		}
		finally {
			if (modelIn != null) {
				try {
					modelIn.close();
				}
				catch (IOException e) {
				}
			}
		}
	}
}

When the above program is run, the output to the console is as shown below:

Token	: Probability
-------------------------------
John	: 1.0
is  	: 1.0
26  	: 1.0
years	: 1.0
old 	: 0.9954218897531331
.   	: 1.0

In this output, TokenizerME separates the period from old. The probability values are returned for the tokenization result from the most recent call to tokenize().

Getting token positions with TokenizerME in OpenNLP

Sometimes you need not only the token text, but also the start and end character positions of each token in the original sentence. In that case, use tokenizePos(), which returns Span objects.

</>

Copy

import opennlp.tools.util.Span;

String sentence = "John is 26 years old.";
Span spans[] = tokenizer.tokenizePos(sentence);

for (Span span : spans) {
    System.out.println(span + " : " + sentence.substring(span.getStart(), span.getEnd()));
}

For the same sentence, the token positions are:

[0..4) : John
[5..7) : is
[8..10) : 26
[11..16) : years
[17..20) : old
[20..21) : .

WhitespaceTokenizer

WhitespaceTokenizer splits the sentence only at whitespace. It does not separate punctuation from a word unless there is a whitespace boundary.

Following is the example to demonstrate WhitespaceTokenizer of OpenNLP Tokenizer API.

WhiteSpaceTokenizerExample.java

</>

Copy

import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.WhitespaceTokenizer;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using WhitespaceTokenizer
 */
public class WhiteSpaceTokenizerExample {

	public static void main(String[] args) {
		Tokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
		String tokens[] = tokenizer.tokenize("John is 26 years old.");

		System.out.println("Token\n----------------");
		for(int i=0;i<tokens.length;i++){
			System.out.println(tokens[i]);
		}
	}
}

When the above program is run, the output to the console is as shown in the following.

Output

Token
----------------
John
is
26
years
old.

Notice that the final token is old.. This is expected because there is no whitespace between old and the period.

SimpleTokenizer

SimpleTokenizer is a rule-based tokenizer. It does not require a trained model, and it can separate punctuation as a different token in simple cases.

Following is the example to demonstrate SimpleTokenizer of OpenNLP Tokenizer API.

SimpleTokenizerExample.java

</>

Copy

import opennlp.tools.tokenize.SimpleTokenizer;
import opennlp.tools.tokenize.Tokenizer;

/**
 * www.tutorialkart.com
 * Tokenizer Example in Apache openNLP using SimpleTokenizer
 */
public class SimpleTokenizerExample {

	public static void main(String[] args) {
		Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
		String tokens[] = tokenizer.tokenize("John is 26 years old.");

		System.out.println("Token\n----------------");
		for(int i=0;i<tokens.length;i++){
			System.out.println(tokens[i]);
		}
	}
}

When the above program is run, the output to the console is as shown in the following.

Output

Token
----------------
John
is
26
years
old
.

Here, SimpleTokenizer returns the full stop as a separate token. This makes it different from WhitespaceTokenizer for the same input sentence.

Which OpenNLP tokenizer should you use in a Java project?

Choose the tokenizer based on the amount of accuracy and setup you need:

Use TokenizerME when you are already using OpenNLP models and need model-based token boundaries.
Use WhitespaceTokenizer only when whitespace splitting is enough and punctuation can remain attached to words.
Use SimpleTokenizer when you want a quick tokenizer without loading a model, and rule-based punctuation splitting is acceptable.

For most OpenNLP pipelines, tokenization is an early step before tasks such as part-of-speech tagging, named entity recognition, and sentence-level processing. If later steps expect clean word and punctuation tokens, test your tokenizer output before feeding it into the next component.

Common OpenNLP tokenizer mistakes to check before running the Java examples

en-token.bin not found: Place the model file in the working directory or provide an absolute file path in FileInputStream.
Wrong tokenizer for punctuation: WhitespaceTokenizer keeps old. as one token because there is no space before the period.
Calling probabilities on the wrong tokenizer: getTokenProbabilities() is used with TokenizerME, not with the simple or whitespace tokenizers.
Not checking token positions: Use tokenizePos() when you need offsets for highlighting, annotation, or mapping tokens back to the original text.

Apache OpenNLP tokenizer FAQ

What is a tokenizer in Java?

A tokenizer in Java is a component that splits a string into smaller units such as words, numbers, and punctuation marks. In Apache OpenNLP, the tokenizer classes return these units as a String[] or as token spans.

What is tokenization with an OpenNLP example?

For the input John is 26 years old., OpenNLP TokenizerME can return John, is, 26, years, old, and . as separate tokens.

How do I use OpenNLP TokenizerME?

Load a pretrained tokenizer model such as en-token.bin into TokenizerModel, create a TokenizerME object, and call tokenize() with the input sentence.

How can I tokenize a given string without an OpenNLP model?

Use WhitespaceTokenizer.INSTANCE for whitespace-only splitting or SimpleTokenizer.INSTANCE for rule-based tokenization. These two options do not require en-token.bin.

Editorial QA checklist for this Apache OpenNLP tokenizer tutorial

The tutorial explains tokenization before showing Java code.
The TokenizerME, WhitespaceTokenizer, and SimpleTokenizer examples use the same input sentence for a fair comparison.
The output blocks clearly show how punctuation is handled by each tokenizer.
The model-based example mentions the required en-token.bin file.
The FAQ answers the common Java tokenizer and OpenNLP usage questions directly.

Summary of Apache OpenNLP tokenizer examples in Java

In this Apache OpenNLP Tutorial, we have seen different ways of tokenization the OpenNLP Tokenizer API provides. TokenizerME uses a pretrained token model, WhitespaceTokenizer splits only on spaces, and SimpleTokenizer provides rule-based tokenization without a model.

Following are some of the other examples of openNLP:

TutorialKart.com

Tokenizer Example in Apache OpenNLP using Java