Extract Images from PDF using PDFBox in Java

We can extract embedded images from a PDF using Apache PDFBox in a Java program.

In this tutorial, we shall learn to extract images from PDF using PDFBox and save the images to a local folder. The example uses PDFStreamEngine to inspect the PDF page content stream, detect image drawing operations, read each image as a BufferedImage, and write it as a PNG file.

This method is useful when you want the actual images embedded inside the PDF. It is different from converting each PDF page into a full-page image. If you need page screenshots or thumbnails, you should render PDF pages as images instead of extracting embedded image objects.

Before You Extract Images from PDF using PDFBox

Make sure that your Java project has Apache PDFBox added to the classpath. You also need a PDF file that contains embedded images. A scanned PDF may contain full-page images, while a text-based PDF may contain smaller logos, diagrams, or photos as separate image objects.

  • Use PDDocument to open the PDF file.
  • Use PDFStreamEngine to process each page content stream.
  • Look for the Do operator, which can draw image or form XObjects.
  • Check whether the object is a PDImageXObject.
  • Use ImageIO.write() to save the extracted image to disk.

Embedded Image Extraction vs PDF Page to Image Conversion

When you extract images from a PDF using PDFBox, you get the image resources stored inside the PDF. The output may be a logo, photo, scanned page image, or any other image object used by the page. When you convert a PDF page into an image, you get one complete image of the whole page, including text, shapes, and images. Choose the method based on the output you need.

TaskWhat You GetPDFBox Approach
Extract embedded imagesIndividual image objects from the PDFProcess page resources and detect PDImageXObject
Turn PDF pages into imagesOne image per PDF pageRender pages with PDF rendering APIs
Extract text and images togetherText content plus image filesUse text extraction and image extraction separately

Steps to Extract Images from PDF using PDFBox

Following is a step by step process to extract images from PDF using PDFBox.

1. Extend PDFStreamEngine to Read PDF Page Operations

Create a Java Class and extend it with PDFStreamEngine. This allows your class to process the content stream operations used on each PDF page.

</>
Copy
public class GetImageLocationsAndSize extends PDFStreamEngine

2. Call processPage() for Every PDF Page

For each of the pages in PDF document, call the method processPage(page). This sends the page content stream to your PDFStreamEngine subclass.

</>
Copy
for( PDPage page : document.getPages() ) {
	pageNum++;
	printer.processPage(page);
}

3. Override processOperator() to Detect PDF Drawing Commands

For each of the object in PDF page, processOperator() is called in processPage(). We shall override processOperator() and check the operator name. Image and form XObjects are commonly drawn using the Do operator.

</>
Copy
@Override
protected void processOperator( Operator operator, List operands) throws IOException{
	. . .
}

4. Check Whether the PDF XObject Is an Image

Check if the object that has been sent to processOperator() is an image object. In PDFBox, embedded image resources are represented as PDImageXObject.

</>
Copy
if( xobject instanceof PDImageXObject){
	. . .
}

5. Save the Extracted PDF Image to a Local File

If the object is an image object, get the BufferedImage and save it to local. Using PDImageXObject.getImage() we get a BufferedImage of type ARGB.

</>
Copy
BufferedImage bImage = image.getImage();
ImageIO.write(bImage,"PNG",new File("image_name.png"));

The example saves every extracted image as PNG. You can change the file extension and the image format argument in ImageIO.write() if your application needs another supported output format.

Example 1 – Extract Images from PDF using PDFBox

In this example, we will take a PDF, and extract all images from this PDF using PDFBox processOperator() method.

SaveImagesInPdf.java

</>
Copy
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.PDFStreamEngine;

import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.List;

import javax.imageio.ImageIO;

/**
 * This is an example on how to extract images from pdf.
 */
public class SaveImagesInPdf extends PDFStreamEngine
{
	/**
	 * Default constructor.
	 *
	 * @throws IOException If there is an error loading text stripper properties.
	 */
	public SaveImagesInPdf() throws IOException
	{
	}

	public int imageNumber = 1;

	/**
	 * @param args The command line arguments.
	 *
	 * @throws IOException If there is an error parsing the document.
	 */
	public static void main( String[] args ) throws IOException
	{
		PDDocument document = null;
		String fileName = "apache.pdf";
		try
		{
			document = PDDocument.load( new File(fileName) );
			SaveImagesInPdf printer = new SaveImagesInPdf();
			int pageNum = 0;
			for( PDPage page : document.getPages() )
			{
				pageNum++;
				System.out.println( "Processing page: " + pageNum );
				printer.processPage(page);
			}
		}
		finally
		{
			if( document != null )
			{
				document.close();
			}
		}
	}

	/**
	 * @param operator The operation to perform.
	 * @param operands The list of arguments.
	 *
	 * @throws IOException If there is an error processing the operation.
	 */
	@Override
	protected void processOperator( Operator operator, List<COSBase> operands) throws IOException
	{
		String operation = operator.getName();
		if( "Do".equals(operation) )
		{
			COSName objectName = (COSName) operands.get( 0 );
			PDXObject xobject = getResources().getXObject( objectName );
			if( xobject instanceof PDImageXObject)
			{
				PDImageXObject image = (PDImageXObject)xobject;

				// same image to local
				BufferedImage bImage = image.getImage();
				ImageIO.write(bImage,"PNG",new File("image_"+imageNumber+".png"));
				System.out.println("Image saved.");
				imageNumber++;

			}
			else if(xobject instanceof PDFormXObject)
			{
				PDFormXObject form = (PDFormXObject)xobject;
				showForm(form);
			}
		}
		else
		{
			super.processOperator( operator, operands);
		}
	}

}

Output

Processing page: 1
Image saved.
Image saved.
Image saved.
Processing page: 2
Image saved.
Image saved.
Processing page: 3
Processing page: 4

Download the pdf document here apache.pdf if you would like use the same PDF file. Else you may assign the fileName in the Java program with your PDF file path.

How the PDFBox Image Extraction Example Works

The program opens the PDF with PDDocument.load() and loops through each PDPage. For every page, processPage(page) reads the content stream and calls processOperator() for PDF drawing operations.

Inside processOperator(), the code checks for the Do operation. The operand gives the name of the XObject resource. The program then reads that resource using getResources().getXObject(objectName). If the resource is a PDImageXObject, the code extracts it as a BufferedImage and saves it with a numbered file name such as image_1.png, image_2.png, and so on.

The example also checks for PDFormXObject. This is important because some images are placed inside form XObjects. Calling showForm(form) allows PDFBox to process the form content so that images nested inside forms can also be found.

Save Extracted PDF Images with Safer File Names

In a real application, save extracted images into a specific output directory and include the page number in the file name. This makes it easier to trace where each image came from, especially when a PDF has many pages.

</>
Copy
File outputDirectory = new File("pdf-images");
if (!outputDirectory.exists()) {
    outputDirectory.mkdirs();
}

File outputFile = new File(outputDirectory, "page_" + pageNum + "_image_" + imageNumber + ".png");
ImageIO.write(bImage, "PNG", outputFile);

This small change avoids writing all images into the project root folder and helps prevent accidental overwriting when you process more than one PDF file.

Common Issues While Extracting Images from PDF using PDFBox

  • No images are saved: The PDF may not contain embedded image objects, or the visible content may be vector graphics instead of raster images.
  • Only one full-page image is extracted: The PDF may be a scanned document where each page is stored as a single image.
  • Duplicate-looking images are saved: Some PDFs reuse images or store masks and image resources separately.
  • Images inside forms are missed: Make sure form XObjects are processed with showForm(form).
  • Output file is overwritten: Use unique file names that include page number, image number, or source PDF name.
  • Large PDFs use more memory: Test with realistic PDF files and process documents carefully in batch jobs.

PDFBox Extract Images QA Checklist

Use this checklist when reviewing Java code that extracts images from PDF files using PDFBox.

  • Confirm that the PDFBox version in the project matches the imports used in the code.
  • Close PDDocument after processing to release file handles and memory.
  • Test with PDFs that contain normal embedded images, scanned pages, and images inside form XObjects.
  • Use a dedicated output directory for extracted image files.
  • Include page number or source file name in the saved image name.
  • Check whether your use case needs embedded images or full-page rendered images.
  • Handle password-protected or invalid PDF files before running batch extraction.

FAQs on Extracting Images from PDF using PDFBox

How do I extract all images from a PDF using PDFBox?

Open the PDF with PDDocument, process each page with a PDFStreamEngine subclass, detect the Do operator, check whether the XObject is a PDImageXObject, and save the image using ImageIO.write().

Can PDFBox extract images from every PDF file?

PDFBox can extract images that are stored as image objects in the PDF. If the visible content is vector artwork, text, or a rendered appearance rather than an embedded raster image, there may be no separate image file to extract.

Why does PDFBox extract a whole page as one image?

This usually happens with scanned PDFs. In many scanned documents, each page is stored as one large image. PDFBox extracts that image because it is the actual image object inside the PDF.

Can I save extracted PDF images as JPG instead of PNG?

Yes. You can change the format argument in ImageIO.write() and use a matching file extension. For example, use "JPG" and a .jpg file name if JPG output is suitable for your images.

Is extracting PDF images the same as converting PDF pages to images?

No. Extracting images saves the embedded image resources from the PDF. Converting PDF pages to images creates a rendered image of each full page, including text, shapes, and images together.

Conclusion

In this Apache PDFBox Tutorial, we have learnt to extract images from pdf using PDFBox and save the BufferedImage of type ARGB to local using PDFStreamEngine class. We also covered when to use embedded image extraction, how the Do operator and PDImageXObject are used, and what to check when a PDF does not produce the expected image files.