How to extract text line by line from PDF using PDFBox?

Extract Text Line by Line from PDF using PDFBox

In this tutorial, we shall learn how to extract text line by line from a PDF document using Apache PDFBox. The examples are written in Java and work for PDF files that contain selectable text.

There are two common ways to read PDF text line by line with PDFTextStripper. The first method is to call getText(), get the full extracted text, and split it by line breaks. The second method is to extend PDFTextStripper and override writeString(), so that each extracted line can be processed as PDFBox reads it.

If your PDF is a scanned document, PDFBox may not return text because the page content is stored as images. For scanned PDFs, run OCR first and then use PDFBox on the searchable PDF.

Method 1 – Extract PDF Text with PDFTextStripper.getText()

You may use the getText() method of PDFTextStripper that has been used in extracting text from pdf. After getting the complete text as a single string, split the text using a newline delimiter to get the lines of the PDF document.

This method is simple and readable. It is suitable when you do not need to process each line immediately and the PDF document is small enough to read into memory as one text string.

</>

Copy

String text = stripper.getText(document);
String[] lines = text.split("\\r?\\n");

You may have to wait for the program until it reads all of the document, strips all text, and then splits the whole text line by line.

If you would like to process the line as soon as it is fetched, the following method is a better option.

Method 2 – Extract PDF Lines with PDFTextStripper.writeString()

The class org.apache.pdfbox.text.PDFTextStripper strips out text from a PDF document. By extending this class, we can intercept the text extraction flow and process each line using writeString(String str, List<TextPosition> textPositions).

To extract text line by line from PDF document using PDFBox, we shall extend this PDFTextStripper class, intercept and implement writeString(String str, List<TextPosition> textPositions) method.

The first argument to writeString() is the extracted text fragment for that call. In many normal text PDFs, this behaves like a line of text. The second argument contains TextPosition objects, which can be used when you need character positions, coordinates, or layout-aware parsing.

Steps to Extract Text Line by Line from PDF in Java

Following is a step by step process to extract text line by line from PDF.

1. Extend PDFTextStripper for Line-by-Line PDF Reading

Create a Java Class and extend it with PDFTextStripper. The custom class lets you override text extraction methods and store each extracted line in a list.

</>

Copy

public class GetWordsFromPDF extends PDFTextStripper {
  . . .
}

In the complete example below, the class name is GetLinesFromPDF, because the program collects PDF text lines rather than individual words.

2. Call writeText Method to Process PDF Pages

Set page boundaries from the first page to the last page, and call the method writeText(). This starts text extraction and internally calls writeString() for the extracted text fragments.

</>

Copy

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);

The writer is only a dummy writer in this approach. The extracted lines are collected inside the overridden writeString() method instead of being written directly to a text file.

3. Override writeString to Capture Each Extracted PDF Line

writeString() receives extracted text as the first argument, which is what we need for line-by-line processing.

</>

Copy

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    . . .
}

Inside this method, you can add the line to a list, print it, write it to a file, or apply custom validation logic immediately.

Example 1 – Extract Text Line by Line from PDF using Apache PDFBox

The following Java program loads a PDF file, reads text from all pages, stores each extracted line in a list, and prints the lines to the console.

GetLinesFromPDF.java

</>

Copy

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
 
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;
 
/**
* This is an example on how to extract text line by line from pdf document
*/
public class GetLinesFromPDF extends PDFTextStripper {
    
    static List<String> lines = new ArrayList<String>();
 
    public GetLinesFromPDF() throws IOException {
    }
 
    /**
     * @throws IOException If there is an error parsing the document.
     */
    public static void main( String[] args ) throws IOException {
        PDDocument document = null;
        String fileName = "apache.pdf";
        try {
            document = PDDocument.load( new File(fileName) );
            PDFTextStripper stripper = new GetLinesFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
 
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);
            
            // print lines
            for(String line:lines){
                System.out.println(line); 
            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
    }
 
    /**
     * Override the default functionality of PDFTextStripper.writeString()
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        lines.add(str);
        // you may process the line here itself, as and when it is obtained
    }
}

Output

2017-8-6
Welcome to The Apache Software Foundation!
Custom Search
The Apache Way (/foundation/governance/)
 (http://apache.org/foundation/contributing.html)
Contribute (https://community.apache.org/contributors/)
ASF Sponsors (/foundation/thanks.html)
OPEN.
THE APACHE SOFTWARE FOUNDATION

Download the PDF document here apache.pdf, if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

Clean Empty Lines while Extracting PDF Text Line by Line

Some PDF files may produce blank lines or lines that contain only spaces. If you do not want those lines in the result, trim the extracted string before adding it to the list.

</>

Copy

@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
    String line = str.trim();
    if (!line.isEmpty()) {
        lines.add(line);
    }
}

This small check is useful when the extracted text is used for indexing, search, comparison, or further text processing.

Choosing getText or writeString for PDF Line Extraction

Requirement	Use getText()	Use writeString()
Simple text extraction	Suitable	Suitable
Process the full PDF text after extraction	Suitable	Possible, but not necessary
Process each extracted line immediately	Not ideal	Suitable
Need access to TextPosition details	Not suitable	Suitable
Custom handling for lines, words, or coordinates	Limited	Better choice

For a quick text dump, getText() is usually enough. For line-by-line parsing, filtering, or layout-aware processing, overriding writeString() gives better control.

Why Extracted PDF Lines May Not Match the Visible PDF Layout

A PDF file does not always store text in the same order in which a reader visually sees it. The document may contain positioned text fragments, multi-column layouts, tables, headers, footers, or custom font encodings. Because of this, extracted lines may appear in a different order or may be split differently from the displayed page.

Use setSortByPosition(true) to improve visual ordering for many documents.
Test with the actual PDF layout used in your application.
Use TextPosition data if you need coordinates or layout-aware grouping.
Run OCR first if the PDF contains scanned page images instead of selectable text.
Handle tables and multi-column documents separately if row and column structure matters.

PDFBox Line Extraction for OCR and Scanned PDFs

PDFBox extracts text that already exists inside the PDF. It does not automatically read text from image-only scanned pages. If a PDF was created from scanned images, first use OCR software to create searchable text, and then run this PDFBox line extraction code on the resulting PDF.

A quick check is to open the PDF in a PDF reader and try selecting the text with the mouse. If you cannot select the text, the file probably needs OCR before Java code can extract readable lines.

FAQs on Extracting Text Line by Line from PDF using PDFBox

How do I extract a line from a PDF using PDFBox?

Extend PDFTextStripper and override writeString(). PDFBox calls this method while extracting text, and you can store or process the received string as a line.

Can I extract just the text from a PDF without line processing?

Yes. Use PDFTextStripper.getText(document) when you only need the complete text content. Then split the returned string by newline if you need a simple line array.

Why does PDFBox return lines in a different order?

PDF text is often stored as positioned drawing instructions, not as normal paragraphs. Use setSortByPosition(true) to improve ordering, but complex layouts may still require custom logic.

Can PDFBox extract lines from scanned PDFs?

PDFBox can extract embedded text, but it does not perform OCR by itself. For scanned PDFs, convert the file into a searchable PDF using OCR, and then extract lines with PDFBox.

Can I extract text coordinates while reading PDF lines?

Yes. The writeString() method receives a List<TextPosition>. You can use those objects to inspect character positions and build coordinate-based parsing logic.

QA Checklist for PDFBox Line-by-Line Text Extraction

Check that the sample PDF contains selectable text and is not image-only.
Confirm that PDDocument is closed after extraction.
Verify that the line order is acceptable with setSortByPosition(true).
Test whether empty lines should be preserved or skipped for your use case.
Check complex PDFs with tables, headers, footers, and multi-column text separately.
Use OCR before PDFBox extraction when the source PDF is scanned.

Conclusion: PDFBox Text Line Extraction in Java

In this PDFBox Tutorial, we have learnt to extract text line by line from PDF using PDFTextStripper.getText() and by overriding PDFTextStripper.writeString(). You may also refer to how we extract words from PDF document when you need word-level text processing.

TutorialKart.com

How to extract text line by line from PDF using PDFBox?

Extract Text Line by Line from PDF using PDFBox

Method 1 – Extract PDF Text with PDFTextStripper.getText()

Method 2 – Extract PDF Lines with PDFTextStripper.writeString()

Steps to Extract Text Line by Line from PDF in Java

1. Extend PDFTextStripper for Line-by-Line PDF Reading

2. Call writeText Method to Process PDF Pages

3. Override writeString to Capture Each Extracted PDF Line

Example 1 – Extract Text Line by Line from PDF using Apache PDFBox

Clean Empty Lines while Extracting PDF Text Line by Line

Choosing getText or writeString for PDF Line Extraction

Why Extracted PDF Lines May Not Match the Visible PDF Layout

PDFBox Line Extraction for OCR and Scanned PDFs

FAQs on Extracting Text Line by Line from PDF using PDFBox

QA Checklist for PDFBox Line-by-Line Text Extraction

Conclusion: PDFBox Text Line Extraction in Java

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning