Extract Images from PDF using PDFBox in Java
We can extract embedded images from a PDF using Apache PDFBox in a Java program.
In this tutorial, we shall learn to extract images from PDF using PDFBox and save the images to a local folder. The example uses PDFStreamEngine to inspect the PDF page content stream, detect image drawing operations, read each image as a BufferedImage, and write it as a PNG file.
This method is useful when you want the actual images embedded inside the PDF. It is different from converting each PDF page into a full-page image. If you need page screenshots or thumbnails, you should render PDF pages as images instead of extracting embedded image objects.
Before You Extract Images from PDF using PDFBox
Make sure that your Java project has Apache PDFBox added to the classpath. You also need a PDF file that contains embedded images. A scanned PDF may contain full-page images, while a text-based PDF may contain smaller logos, diagrams, or photos as separate image objects.
- Use
PDDocumentto open the PDF file. - Use
PDFStreamEngineto process each page content stream. - Look for the
Dooperator, which can draw image or form XObjects. - Check whether the object is a
PDImageXObject. - Use
ImageIO.write()to save the extracted image to disk.
Embedded Image Extraction vs PDF Page to Image Conversion
When you extract images from a PDF using PDFBox, you get the image resources stored inside the PDF. The output may be a logo, photo, scanned page image, or any other image object used by the page. When you convert a PDF page into an image, you get one complete image of the whole page, including text, shapes, and images. Choose the method based on the output you need.
| Task | What You Get | PDFBox Approach |
|---|---|---|
| Extract embedded images | Individual image objects from the PDF | Process page resources and detect PDImageXObject |
| Turn PDF pages into images | One image per PDF page | Render pages with PDF rendering APIs |
| Extract text and images together | Text content plus image files | Use text extraction and image extraction separately |
Steps to Extract Images from PDF using PDFBox
Following is a step by step process to extract images from PDF using PDFBox.
1. Extend PDFStreamEngine to Read PDF Page Operations
Create a Java Class and extend it with PDFStreamEngine. This allows your class to process the content stream operations used on each PDF page.
public class GetImageLocationsAndSize extends PDFStreamEngine
2. Call processPage() for Every PDF Page
For each of the pages in PDF document, call the method processPage(page). This sends the page content stream to your PDFStreamEngine subclass.
for( PDPage page : document.getPages() ) {
pageNum++;
printer.processPage(page);
}
3. Override processOperator() to Detect PDF Drawing Commands
For each of the object in PDF page, processOperator() is called in processPage(). We shall override processOperator() and check the operator name. Image and form XObjects are commonly drawn using the Do operator.
@Override
protected void processOperator( Operator operator, List operands) throws IOException{
. . .
}
4. Check Whether the PDF XObject Is an Image
Check if the object that has been sent to processOperator() is an image object. In PDFBox, embedded image resources are represented as PDImageXObject.
if( xobject instanceof PDImageXObject){
. . .
}
5. Save the Extracted PDF Image to a Local File
If the object is an image object, get the BufferedImage and save it to local. Using PDImageXObject.getImage() we get a BufferedImage of type ARGB.
BufferedImage bImage = image.getImage();
ImageIO.write(bImage,"PNG",new File("image_name.png"));
The example saves every extracted image as PNG. You can change the file extension and the image format argument in ImageIO.write() if your application needs another supported output format.
Example 1 – Extract Images from PDF using PDFBox
In this example, we will take a PDF, and extract all images from this PDF using PDFBox processOperator() method.
SaveImagesInPdf.java
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.List;
import javax.imageio.ImageIO;
/**
* This is an example on how to extract images from pdf.
*/
public class SaveImagesInPdf extends PDFStreamEngine
{
/**
* Default constructor.
*
* @throws IOException If there is an error loading text stripper properties.
*/
public SaveImagesInPdf() throws IOException
{
}
public int imageNumber = 1;
/**
* @param args The command line arguments.
*
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException
{
PDDocument document = null;
String fileName = "apache.pdf";
try
{
document = PDDocument.load( new File(fileName) );
SaveImagesInPdf printer = new SaveImagesInPdf();
int pageNum = 0;
for( PDPage page : document.getPages() )
{
pageNum++;
System.out.println( "Processing page: " + pageNum );
printer.processPage(page);
}
}
finally
{
if( document != null )
{
document.close();
}
}
}
/**
* @param operator The operation to perform.
* @param operands The list of arguments.
*
* @throws IOException If there is an error processing the operation.
*/
@Override
protected void processOperator( Operator operator, List<COSBase> operands) throws IOException
{
String operation = operator.getName();
if( "Do".equals(operation) )
{
COSName objectName = (COSName) operands.get( 0 );
PDXObject xobject = getResources().getXObject( objectName );
if( xobject instanceof PDImageXObject)
{
PDImageXObject image = (PDImageXObject)xobject;
// same image to local
BufferedImage bImage = image.getImage();
ImageIO.write(bImage,"PNG",new File("image_"+imageNumber+".png"));
System.out.println("Image saved.");
imageNumber++;
}
else if(xobject instanceof PDFormXObject)
{
PDFormXObject form = (PDFormXObject)xobject;
showForm(form);
}
}
else
{
super.processOperator( operator, operands);
}
}
}
Output
Processing page: 1
Image saved.
Image saved.
Image saved.
Processing page: 2
Image saved.
Image saved.
Processing page: 3
Processing page: 4
Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.
How the PDFBox Image Extraction Example Works
The program opens the PDF with PDDocument.load() and loops through each PDPage. For every page, processPage(page) reads the content stream and calls processOperator() for PDF drawing operations.
Inside processOperator(), the code checks for the Do operation. The operand gives the name of the XObject resource. The program then reads that resource using getResources().getXObject(objectName). If the resource is a PDImageXObject, the code extracts it as a BufferedImage and saves it with a numbered file name such as image_1.png, image_2.png, and so on.
The example also checks for PDFormXObject. This is important because some images are placed inside form XObjects. Calling showForm(form) allows PDFBox to process the form content so that images nested inside forms can also be found.
Save Extracted PDF Images with Safer File Names
In a real application, save extracted images into a specific output directory and include the page number in the file name. This makes it easier to trace where each image came from, especially when a PDF has many pages.
File outputDirectory = new File("pdf-images");
if (!outputDirectory.exists()) {
outputDirectory.mkdirs();
}
File outputFile = new File(outputDirectory, "page_" + pageNum + "_image_" + imageNumber + ".png");
ImageIO.write(bImage, "PNG", outputFile);
This small change avoids writing all images into the project root folder and helps prevent accidental overwriting when you process more than one PDF file.
Common Issues While Extracting Images from PDF using PDFBox
- No images are saved: The PDF may not contain embedded image objects, or the visible content may be vector graphics instead of raster images.
- Only one full-page image is extracted: The PDF may be a scanned document where each page is stored as a single image.
- Duplicate-looking images are saved: Some PDFs reuse images or store masks and image resources separately.
- Images inside forms are missed: Make sure form XObjects are processed with
showForm(form). - Output file is overwritten: Use unique file names that include page number, image number, or source PDF name.
- Large PDFs use more memory: Test with realistic PDF files and process documents carefully in batch jobs.
PDFBox Extract Images QA Checklist
Use this checklist when reviewing Java code that extracts images from PDF files using PDFBox.
- Confirm that the PDFBox version in the project matches the imports used in the code.
- Close
PDDocumentafter processing to release file handles and memory. - Test with PDFs that contain normal embedded images, scanned pages, and images inside form XObjects.
- Use a dedicated output directory for extracted image files.
- Include page number or source file name in the saved image name.
- Check whether your use case needs embedded images or full-page rendered images.
- Handle password-protected or invalid PDF files before running batch extraction.
FAQs on Extracting Images from PDF using PDFBox
How do I extract all images from a PDF using PDFBox?
Open the PDF with PDDocument, process each page with a PDFStreamEngine subclass, detect the Do operator, check whether the XObject is a PDImageXObject, and save the image using ImageIO.write().
Can PDFBox extract images from every PDF file?
PDFBox can extract images that are stored as image objects in the PDF. If the visible content is vector artwork, text, or a rendered appearance rather than an embedded raster image, there may be no separate image file to extract.
Why does PDFBox extract a whole page as one image?
This usually happens with scanned PDFs. In many scanned documents, each page is stored as one large image. PDFBox extracts that image because it is the actual image object inside the PDF.
Can I save extracted PDF images as JPG instead of PNG?
Yes. You can change the format argument in ImageIO.write() and use a matching file extension. For example, use "JPG" and a .jpg file name if JPG output is suitable for your images.
Is extracting PDF images the same as converting PDF pages to images?
No. Extracting images saves the embedded image resources from the PDF. Converting PDF pages to images creates a rendered image of each full page, including text, shapes, and images together.
Conclusion
In this Apache PDFBox Tutorial, we have learnt to extract images from pdf using PDFBox and save the BufferedImage of type ARGB to local using PDFStreamEngine class. We also covered when to use embedded image extraction, how the Do operator and PDImageXObject are used, and what to check when a PDF does not produce the expected image files.
TutorialKart.com