How-to guides/Document Info

Detect whether a PDF is scanned or contains selectable text

Before running text extraction, check whether the PDF was digitally created or is a scan of a physical document.

rust
use pdfluent::PdfDocument;

fn main() -> pdfluent::Result<()> {
    let doc = PdfDocument::open("document.pdf")?;

    for (i, page) in doc.pages().enumerate() {
        let has_text = page.has_selectable_text();
        let has_images = page.has_raster_images();

        println!(
            "Page {}: text={} images={}  => {}",
            i + 1,
            has_text,
            has_images,
            classify(has_text, has_images)
        );
    }

    Ok(())
}

fn classify(has_text: bool, has_images: bool) -> &'static str {
    match (has_text, has_images) {
        (true, _)      => "digital",
        (false, true)  => "scanned",
        (false, false) => "blank or vector-only",
    }
}
Install:cargo add pdfluentDownload SDK →

Step by step

1

Add PDFluent to Cargo.toml

No additional features are required. Page inspection is part of the base crate.

rust
# Cargo.toml
[dependencies]
pdfluent = "0.9"
2

Open the document and iterate pages

Use doc.pages() to get an iterator over all pages. Each Page gives you access to content stream analysis.

rust
use pdfluent::PdfDocument;

let doc = PdfDocument::open("document.pdf")?;

for (i, page) in doc.pages().enumerate() {
    println!("Page {}: {:?}", i + 1, page.content_type());
}
3

Check for selectable text and raster images

has_selectable_text() returns true if the page content stream contains any text operators. has_raster_images() returns true if the page contains XObject images.

rust
for page in doc.pages() {
    let has_text   = page.has_selectable_text();
    let has_images = page.has_raster_images();

    if !has_text && has_images {
        println!("This page appears to be a scan.");
    }
}
4

Get a document-level scan score

Count pages without text. A score above 80% is a strong indicator that the document is a scan or a mix.

rust
let total = doc.page_count() as f32;
let no_text = doc.pages()
    .filter(|p| !p.has_selectable_text())
    .count() as f32;

let scan_ratio = no_text / total;
println!("Scan ratio: {:.0}%", scan_ratio * 100.0);

if scan_ratio > 0.8 {
    println!("Likely a scanned document. Consider running OCR.");
}
5

Check whether text is hidden (OCR layer)

Some scanned PDFs have a hidden text layer added by OCR software. Use has_invisible_text() to detect this.

rust
for (i, page) in doc.pages().enumerate() {
    if page.has_invisible_text() {
        println!(
            "Page {} has an OCR text layer (invisible text).",
            i + 1
        );
    }
}

Notes and tips

  • A page with a background image and no text operators is the most common scan pattern. This method has low false-positive rates.
  • PDFs created by scanning software like Adobe Scan often include a hidden OCR text layer. has_invisible_text() detects this.
  • Vector PDFs with no images and no text (diagrams, flowcharts) return false for both flags. Use has_vector_content() for those.
  • Text that is covered by a white rectangle may still be detected as selectable text. Pixel-level analysis requires rasterizing the page.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions