Learn to create, edit and process PDFs using Java by following this informative Apache PDFBox Tutorial. This guide explains what Apache PDFBox is used for, how to set up a Java project, and which PDF operations you can learn step by step.

Apache PDFBox Tutorial for Java PDF Processing
What Apache PDFBox Is Used For
Apache PDFBox is an open source Java library from the Apache Software Foundation for working with PDF documents. You can use it to create new PDF files, read text from existing PDFs, extract images, split and merge documents, work with forms, render pages as images, print PDFs, and apply digital signatures.
PDFBox is useful when a Java application needs programmatic PDF processing instead of manual editing. For example, a backend service may generate reports, an admin tool may merge uploaded files, or a document-processing application may extract text and image metadata from many PDFs.
The official Apache PDFBox project also provides command-line utilities, so some operations can be tested from the terminal before they are implemented inside a Java application.
Setup a Java project with pdfbox libraries to start working on pdf files.
Apache PDFBox Setup Before Running Java Examples
Before writing PDFBox code, create a Java project and add the PDFBox library dependency. In Maven or Gradle projects, prefer using the dependency coordinates recommended by the official PDFBox getting started page for the version you plan to use. In a simple Java project, add the required PDFBox JAR files and their dependencies to the classpath.
For most examples, you will work with classes such as PDDocument, PDPage, PDPageContentStream, PDFTextStripper, and PDF-related model classes. Always close documents and content streams after use. The simplest way is to use Java try-with-resources blocks.
try (PDDocument document = new PDDocument()) {
// Create, edit, read, or save the PDF here.
}
Apache PDFBox Features Covered in This Tutorial
The following sections group the main PDFBox operations by task. Use them as a learning path if you are new to Apache PDFBox, or as a reference when you need a specific PDF operation in a Java project.
Extract Text, Words, Coordinates, and Images from PDF Files
Text extraction is one of the most common uses of Apache PDFBox. You can extract plain text, read text line by line, inspect words, and find the coordinates of characters. PDFBox can also help identify and extract images embedded inside a PDF document.
- Extract text from PDF file.
- Extract position and size of characters in the PDF file.
- Extract words from PDF document.
- Extract text line by line from PDF document.
- Get position and size of images in the PDF.
- Extract images from PDF.
A basic text extraction program usually opens the PDF, passes the document to PDFTextStripper, and reads the returned text.
import java.io.File;
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class ExtractPdfTextExample {
public static void main(String[] args) throws IOException {
File file = new File("sample.pdf");
try (PDDocument document = PDDocument.load(file)) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println(text);
}
}
}
Text extraction depends on how the PDF stores text internally. A scanned PDF that contains only images usually needs OCR before useful text can be extracted.
Split and Merge PDF Documents with Apache PDFBox
PDFBox can divide a large PDF into multiple files or combine several PDFs into one document. These operations are common in document upload systems, report generation workflows, and file management tools.
Create New PDF Files and Write Text in Java
Apache PDFBox can create PDF documents from scratch. A typical program creates a PDDocument, adds one or more PDPage objects, opens a PDPageContentStream, writes text or draws content, and saves the file.
import java.io.IOException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
public class CreatePdfExample {
public static void main(String[] args) throws IOException {
try (PDDocument document = new PDDocument()) {
PDPage page = new PDPage();
document.addPage(page);
try (PDPageContentStream contentStream = new PDPageContentStream(document, page)) {
contentStream.beginText();
contentStream.setFont(PDType1Font.HELVETICA, 14);
contentStream.newLineAtOffset(50, 700);
contentStream.showText("Hello from Apache PDFBox");
contentStream.endText();
}
document.save("created-with-pdfbox.pdf");
}
}
}
Fill PDF Forms Using Apache PDFBox
PDF forms, also known as AcroForms, contain fields such as text boxes, check boxes, radio buttons, and dropdowns. PDFBox can read form fields and set values programmatically when the PDF contains fillable form fields.
- Extract data from PDF form.
- Fill a PDF form.
When working with forms, first inspect the field names in the PDF. Field names are not always the same as the visible labels shown to a user.
Print PDF Files from a Java Application
PDFBox supports printing PDF documents programmatically through Java printing APIs. This is useful for desktop applications, internal document tools, and controlled print workflows where a PDF must be sent to a printer from code.
- Print a PDF file programmatically.
Render PDF Pages and Save Them as Images
PDFBox can render pages as images. This is useful for previews, thumbnails, page snapshots, and visual checks in document-processing workflows.
- Save pages in PDF file as images.
Digitally Sign PDF Files with Apache PDFBox
Apache PDFBox can be used in PDF signing workflows where a document needs a digital signature. Digital signing requires more than placing an image of a signature on a page. It usually involves certificates, private keys, signature dictionaries, and validation requirements based on the application or organization.
- Digitally sign PDF file.
Apache PDFBox Compared with Manual PDF Editors and iText
Apache PDFBox is a Java library for developers. It is different from a manual PDF editor, where a user opens a PDF and edits it visually. Use PDFBox when a Java program must create, read, convert, split, merge, or update PDFs automatically.
PDFBox and iText are both Java PDF libraries, but they are often chosen for different project needs. PDFBox is commonly used when an open source Apache-licensed library is preferred for PDF extraction, manipulation, and general processing. iText is also widely used for PDF generation and advanced PDF workflows, but its licensing model should be reviewed carefully before use in commercial applications.
Common Apache PDFBox Errors and Practical Checks
- Blank extracted text: Check whether the PDF is scanned or image-only. PDFBox reads embedded text, not text inside images.
- File locked or not saved: Confirm that
PDDocumentandPDPageContentStreamare closed properly. - Text appears in the wrong position: PDF coordinates start from the bottom-left of the page, so verify the x and y values used in the content stream.
- Font-related issues: Use fonts supported by your PDFBox version and embed fonts when the output must look the same on different systems.
- Large memory usage: Process large PDFs carefully and test with realistic files before using the code in production.
Apache PDFBox Tutorial QA Checklist
Use this checklist when reviewing PDFBox code examples or extending this tutorial for a real Java project.
- Confirm that the PDFBox dependency version matches the examples used in the project.
- Use try-with-resources for
PDDocumentandPDPageContentStream. - Test extraction examples with both text-based PDFs and scanned PDFs.
- Verify page coordinates when writing text, drawing shapes, or reading character positions.
- Check licensing requirements before comparing or replacing PDFBox with another PDF library.
- Test split, merge, and render operations with password-protected, large, and multi-page PDFs if your application may receive them.
Apache PDFBox FAQs
What is the use of Apache PDFBox in Java?
Apache PDFBox is used in Java applications to create, read, edit, split, merge, render, print, and sign PDF documents. It is useful when PDF processing must be automated in code.
Can Apache PDFBox edit an existing PDF?
Yes. Apache PDFBox can modify existing PDFs by adding pages, writing new content, updating form fields, changing metadata, and performing other supported operations. Editing existing page content can be more complex because PDF files store content as drawing instructions rather than simple editable paragraphs.
Can PDFBox extract text from scanned PDFs?
PDFBox can extract text only when text is stored in the PDF. If a scanned PDF contains page images without embedded text, you need OCR processing before meaningful text can be extracted.
What is the difference between PDFBox and iText?
PDFBox is an Apache-licensed open source Java library often used for general PDF processing, extraction, and manipulation. iText is another Java PDF library commonly used for PDF generation and advanced workflows, but its license terms are different, so they should be checked before selecting it for a project.
Is Apache PDFBox a visual PDF editor?
No. Apache PDFBox is not a visual PDF editor like a desktop or online PDF editing tool. It is a developer library used to process PDFs from Java code.
Apache PDFBox Learning Path Summary
In this Apache PDFBox Tutorial, we have covered the main PDFBox operations used in Java projects: setting up the library, extracting text and images, splitting and merging files, creating new PDFs, filling forms, printing, rendering pages as images, and digitally signing documents. Start with setup and text extraction, then move to PDF creation, merge, split, and form examples as your project requires.
TutorialKart.com