Extract Text Line by Line from PDF using PDFBox
In this tutorial, we shall learn how to extract text line by line from a PDF document using Apache PDFBox. The examples are written in Java and work for PDF files that contain selectable text.
There are two common ways to read PDF text line by line with PDFTextStripper. The first method is to call getText(), get the full extracted text, and split it by line breaks. The second method is to extend PDFTextStripper and override writeString(), so that each extracted line can be processed as PDFBox reads it.
If your PDF is a scanned document, PDFBox may not return text because the page content is stored as images. For scanned PDFs, run OCR first and then use PDFBox on the searchable PDF.
Method 1 – Extract PDF Text with PDFTextStripper.getText()
You may use the getText() method of PDFTextStripper that has been used in extracting text from pdf. After getting the complete text as a single string, split the text using a newline delimiter to get the lines of the PDF document.
This method is simple and readable. It is suitable when you do not need to process each line immediately and the PDF document is small enough to read into memory as one text string.
String text = stripper.getText(document);
String[] lines = text.split("\\r?\\n");
You may have to wait for the program until it reads all of the document, strips all text, and then splits the whole text line by line.
If you would like to process the line as soon as it is fetched, the following method is a better option.
Method 2 – Extract PDF Lines with PDFTextStripper.writeString()
The class org.apache.pdfbox.text.PDFTextStripper strips out text from a PDF document. By extending this class, we can intercept the text extraction flow and process each line using writeString(String str, List<TextPosition> textPositions).
To extract text line by line from PDF document using PDFBox, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.
The first argument to writeString() is the extracted text fragment for that call. In many normal text PDFs, this behaves like a line of text. The second argument contains TextPosition objects, which can be used when you need character positions, coordinates, or layout-aware parsing.
Steps to Extract Text Line by Line from PDF in Java
Following is a step by step process to extract text line by line from PDF.
1. Extend PDFTextStripper for Line-by-Line PDF Reading
Create a Java Class and extend it with PDFTextStripper. The custom class lets you override text extraction methods and store each extracted line in a list.
public class GetWordsFromPDF extends PDFTextStripper {
. . .
}
In the complete example below, the class name is GetLinesFromPDF, because the program collects PDF text lines rather than individual words.
2. Call writeText Method to Process PDF Pages
Set page boundaries from the first page to the last page, and call the method writeText(). This starts text extraction and internally calls writeString() for the extracted text fragments.
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
The writer is only a dummy writer in this approach. The extracted lines are collected inside the overridden writeString() method instead of being written directly to a text file.
3. Override writeString to Capture Each Extracted PDF Line
writeString() receives extracted text as the first argument, which is what we need for line-by-line processing.
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
. . .
}
Inside this method, you can add the line to a list, print it, write it to a file, or apply custom validation logic immediately.
Example 1 – Extract Text Line by Line from PDF using Apache PDFBox
The following Java program loads a PDF file, reads text from all pages, stores each extracted line in a list, and prints the lines to the console.
GetLinesFromPDF.java
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
/**
* This is an example on how to extract text line by line from pdf document
*/
public class GetLinesFromPDF extends PDFTextStripper {
static List<String> lines = new ArrayList<String>();
public GetLinesFromPDF() throws IOException {
}
/**
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf";
try {
document = PDDocument.load( new File(fileName) );
PDFTextStripper stripper = new GetLinesFromPDF();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
// print lines
for(String line:lines){
System.out.println(line);
}
}
finally {
if( document != null ) {
document.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
lines.add(str);
// you may process the line here itself, as and when it is obtained
}
}
Output
2017-8-6
Welcome to The Apache Software Foundation!
Custom Search
The Apache Way (/foundation/governance/)
(http://apache.org/foundation/contributing.html)
Contribute (https://community.apache.org/contributors/)
ASF Sponsors (/foundation/thanks.html)
OPEN.
THE APACHE SOFTWARE FOUNDATION
Download the PDF document here apache.pdf, if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.
Clean Empty Lines while Extracting PDF Text Line by Line
Some PDF files may produce blank lines or lines that contain only spaces. If you do not want those lines in the result, trim the extracted string before adding it to the list.
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
String line = str.trim();
if (!line.isEmpty()) {
lines.add(line);
}
}
This small check is useful when the extracted text is used for indexing, search, comparison, or further text processing.
Choosing getText or writeString for PDF Line Extraction
| Requirement | Use getText() | Use writeString() |
|---|---|---|
| Simple text extraction | Suitable | Suitable |
| Process the full PDF text after extraction | Suitable | Possible, but not necessary |
| Process each extracted line immediately | Not ideal | Suitable |
| Need access to TextPosition details | Not suitable | Suitable |
| Custom handling for lines, words, or coordinates | Limited | Better choice |
For a quick text dump, getText() is usually enough. For line-by-line parsing, filtering, or layout-aware processing, overriding writeString() gives better control.
Why Extracted PDF Lines May Not Match the Visible PDF Layout
A PDF file does not always store text in the same order in which a reader visually sees it. The document may contain positioned text fragments, multi-column layouts, tables, headers, footers, or custom font encodings. Because of this, extracted lines may appear in a different order or may be split differently from the displayed page.
- Use
setSortByPosition(true)to improve visual ordering for many documents. - Test with the actual PDF layout used in your application.
- Use
TextPositiondata if you need coordinates or layout-aware grouping. - Run OCR first if the PDF contains scanned page images instead of selectable text.
- Handle tables and multi-column documents separately if row and column structure matters.
PDFBox Line Extraction for OCR and Scanned PDFs
PDFBox extracts text that already exists inside the PDF. It does not automatically read text from image-only scanned pages. If a PDF was created from scanned images, first use OCR software to create searchable text, and then run this PDFBox line extraction code on the resulting PDF.
A quick check is to open the PDF in a PDF reader and try selecting the text with the mouse. If you cannot select the text, the file probably needs OCR before Java code can extract readable lines.
FAQs on Extracting Text Line by Line from PDF using PDFBox
How do I extract a line from a PDF using PDFBox?
Extend PDFTextStripper and override writeString(). PDFBox calls this method while extracting text, and you can store or process the received string as a line.
Can I extract just the text from a PDF without line processing?
Yes. Use PDFTextStripper.getText(document) when you only need the complete text content. Then split the returned string by newline if you need a simple line array.
Why does PDFBox return lines in a different order?
PDF text is often stored as positioned drawing instructions, not as normal paragraphs. Use setSortByPosition(true) to improve ordering, but complex layouts may still require custom logic.
Can PDFBox extract lines from scanned PDFs?
PDFBox can extract embedded text, but it does not perform OCR by itself. For scanned PDFs, convert the file into a searchable PDF using OCR, and then extract lines with PDFBox.
Can I extract text coordinates while reading PDF lines?
Yes. The writeString() method receives a List<TextPosition>. You can use those objects to inspect character positions and build coordinate-based parsing logic.
QA Checklist for PDFBox Line-by-Line Text Extraction
- Check that the sample PDF contains selectable text and is not image-only.
- Confirm that
PDDocumentis closed after extraction. - Verify that the line order is acceptable with
setSortByPosition(true). - Test whether empty lines should be preserved or skipped for your use case.
- Check complex PDFs with tables, headers, footers, and multi-column text separately.
- Use OCR before PDFBox extraction when the source PDF is scanned.
Conclusion: PDFBox Text Line Extraction in Java
In this PDFBox Tutorial, we have learnt to extract text line by line from PDF using PDFTextStripper.getText() and by overriding PDFTextStripper.writeString(). You may also refer to how we extract words from PDF document when you need word-level text processing.
TutorialKart.com