Apache PDFBox Tutorial - Learn to create, edit and process PDFs

Learn to create, edit and process PDFs using Java by following this informative Apache PDFBox Tutorial. This guide explains what Apache PDFBox is used for, how to set up a Java project, and which PDF operations you can learn step by step.

Apache PDFBox Tutorial - www.tutorialkart.com

Apache PDFBox Tutorial for Java PDF Processing

What Apache PDFBox Is Used For

Apache PDFBox is an open source Java library from the Apache Software Foundation for working with PDF documents. You can use it to create new PDF files, read text from existing PDFs, extract images, split and merge documents, work with forms, render pages as images, print PDFs, and apply digital signatures.

PDFBox is useful when a Java application needs programmatic PDF processing instead of manual editing. For example, a backend service may generate reports, an admin tool may merge uploaded files, or a document-processing application may extract text and image metadata from many PDFs.

The official Apache PDFBox project also provides command-line utilities, so some operations can be tested from the terminal before they are implemented inside a Java application.

Setup a Java project with pdfbox libraries to start working on pdf files.

Apache PDFBox Setup Before Running Java Examples

Before writing PDFBox code, create a Java project and add the PDFBox library dependency. In Maven or Gradle projects, prefer using the dependency coordinates recommended by the official PDFBox getting started page for the version you plan to use. In a simple Java project, add the required PDFBox JAR files and their dependencies to the classpath.

For most examples, you will work with classes such as PDDocument, PDPage, PDPageContentStream, PDFTextStripper, and PDF-related model classes. Always close documents and content streams after use. The simplest way is to use Java try-with-resources blocks.

</>

Copy

try (PDDocument document = new PDDocument()) {
    // Create, edit, read, or save the PDF here.
}

Apache PDFBox Features Covered in This Tutorial

The following sections group the main PDFBox operations by task. Use them as a learning path if you are new to Apache PDFBox, or as a reference when you need a specific PDF operation in a Java project.

Extract Text, Words, Coordinates, and Images from PDF Files

Text extraction is one of the most common uses of Apache PDFBox. You can extract plain text, read text line by line, inspect words, and find the coordinates of characters. PDFBox can also help identify and extract images embedded inside a PDF document.

A basic text extraction program usually opens the PDF, passes the document to PDFTextStripper, and reads the returned text.

</>

Copy

import java.io.File;
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class ExtractPdfTextExample {
    public static void main(String[] args) throws IOException {
        File file = new File("sample.pdf");

        try (PDDocument document = PDDocument.load(file)) {
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            System.out.println(text);
        }
    }
}

Text extraction depends on how the PDF stores text internally. A scanned PDF that contains only images usually needs OCR before useful text can be extracted.

Split and Merge PDF Documents with Apache PDFBox

PDFBox can divide a large PDF into multiple files or combine several PDFs into one document. These operations are common in document upload systems, report generation workflows, and file management tools.

Create New PDF Files and Write Text in Java

Apache PDFBox can create PDF documents from scratch. A typical program creates a PDDocument, adds one or more PDPage objects, opens a PDPageContentStream, writes text or draws content, and saves the file.

Create a new PDF file and write text to it.

</>

Copy

import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;

public class CreatePdfExample {
    public static void main(String[] args) throws IOException {
        try (PDDocument document = new PDDocument()) {
            PDPage page = new PDPage();
            document.addPage(page);

            try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
                contentStream.beginText();
                contentStream.setFont(PDType1Font.HELVETICA, 14);
                contentStream.newLineAtOffset(50, 700);
                contentStream.showText("Hello from Apache PDFBox");
                contentStream.endText();
            }

            document.save("created-with-pdfbox.pdf");
        }
    }
}

Fill PDF Forms Using Apache PDFBox

PDF forms, also known as AcroForms, contain fields such as text boxes, check boxes, radio buttons, and dropdowns. PDFBox can read form fields and set values programmatically when the PDF contains fillable form fields.

Extract data from PDF form.
Fill a PDF form.

When working with forms, first inspect the field names in the PDF. Field names are not always the same as the visible labels shown to a user.

Print PDF Files from a Java Application

PDFBox supports printing PDF documents programmatically through Java printing APIs. This is useful for desktop applications, internal document tools, and controlled print workflows where a PDF must be sent to a printer from code.

Print a PDF file programmatically.

Render PDF Pages and Save Them as Images

PDFBox can render pages as images. This is useful for previews, thumbnails, page snapshots, and visual checks in document-processing workflows.

Save pages in PDF file as images.

Digitally Sign PDF Files with Apache PDFBox

Apache PDFBox can be used in PDF signing workflows where a document needs a digital signature. Digital signing requires more than placing an image of a signature on a page. It usually involves certificates, private keys, signature dictionaries, and validation requirements based on the application or organization.

Digitally sign PDF file.

Apache PDFBox Compared with Manual PDF Editors and iText

Apache PDFBox is a Java library for developers. It is different from a manual PDF editor, where a user opens a PDF and edits it visually. Use PDFBox when a Java program must create, read, convert, split, merge, or update PDFs automatically.

PDFBox and iText are both Java PDF libraries, but they are often chosen for different project needs. PDFBox is commonly used when an open source Apache-licensed library is preferred for PDF extraction, manipulation, and general processing. iText is also widely used for PDF generation and advanced PDF workflows, but its licensing model should be reviewed carefully before use in commercial applications.

Common Apache PDFBox Errors and Practical Checks

Blank extracted text: Check whether the PDF is scanned or image-only. PDFBox reads embedded text, not text inside images.
File locked or not saved: Confirm that PDDocument and PDPageContentStream are closed properly.
Text appears in the wrong position: PDF coordinates start from the bottom-left of the page, so verify the x and y values used in the content stream.
Font-related issues: Use fonts supported by your PDFBox version and embed fonts when the output must look the same on different systems.
Large memory usage: Process large PDFs carefully and test with realistic files before using the code in production.

Apache PDFBox Tutorial QA Checklist

Use this checklist when reviewing PDFBox code examples or extending this tutorial for a real Java project.

Confirm that the PDFBox dependency version matches the examples used in the project.
Use try-with-resources for PDDocument and PDPageContentStream.
Test extraction examples with both text-based PDFs and scanned PDFs.
Verify page coordinates when writing text, drawing shapes, or reading character positions.
Check licensing requirements before comparing or replacing PDFBox with another PDF library.
Test split, merge, and render operations with password-protected, large, and multi-page PDFs if your application may receive them.

Apache PDFBox FAQs

What is the use of Apache PDFBox in Java?

Apache PDFBox is used in Java applications to create, read, edit, split, merge, render, print, and sign PDF documents. It is useful when PDF processing must be automated in code.

Can Apache PDFBox edit an existing PDF?

Yes. Apache PDFBox can modify existing PDFs by adding pages, writing new content, updating form fields, changing metadata, and performing other supported operations. Editing existing page content can be more complex because PDF files store content as drawing instructions rather than simple editable paragraphs.

Can PDFBox extract text from scanned PDFs?

PDFBox can extract text only when text is stored in the PDF. If a scanned PDF contains page images without embedded text, you need OCR processing before meaningful text can be extracted.

What is the difference between PDFBox and iText?

PDFBox is an Apache-licensed open source Java library often used for general PDF processing, extraction, and manipulation. iText is another Java PDF library commonly used for PDF generation and advanced workflows, but its license terms are different, so they should be checked before selecting it for a project.

Is Apache PDFBox a visual PDF editor?

No. Apache PDFBox is not a visual PDF editor like a desktop or online PDF editing tool. It is a developer library used to process PDFs from Java code.

Apache PDFBox Learning Path Summary

In this Apache PDFBox Tutorial, we have covered the main PDFBox operations used in Java projects: setting up the library, extracting text and images, splitting and merging files, creating new PDFs, filling forms, printing, rendering pages as images, and digitally signing documents. Start with setup and text extraction, then move to PDF creation, merge, split, and form examples as your project requires.

TutorialKart.com

Apache PDFBox Tutorial – Learn to create, edit and process PDFs

Apache PDFBox Tutorial for Java PDF Processing

What Apache PDFBox Is Used For

Apache PDFBox Setup Before Running Java Examples

Apache PDFBox Features Covered in This Tutorial

Extract Text, Words, Coordinates, and Images from PDF Files

Split and Merge PDF Documents with Apache PDFBox

Create New PDF Files and Write Text in Java

Fill PDF Forms Using Apache PDFBox

Print PDF Files from a Java Application

Render PDF Pages and Save Them as Images

Digitally Sign PDF Files with Apache PDFBox

Apache PDFBox Compared with Manual PDF Editors and iText

Common Apache PDFBox Errors and Practical Checks

Apache PDFBox Tutorial QA Checklist

Apache PDFBox FAQs

What is the use of Apache PDFBox in Java?

Can Apache PDFBox edit an existing PDF?

Can PDFBox extract text from scanned PDFs?

What is the difference between PDFBox and iText?

Is Apache PDFBox a visual PDF editor?

Apache PDFBox Learning Path Summary

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning