Solutions

Make scanned PDFs searchable.

Built-in Rust OCR engine or connect to Google Cloud Vision, AWS Textract, or Azure AI — your choice. PDFluent handles the PDF side either way.

Code example

rust
use pdfluent::{Sdk, ocr::{OcrLayerOptions, HocrResult}};
use std::process::Command;

fn main() -> pdfluent::Result<()> {
    let sdk = Sdk::new()?;
    let doc = sdk.open("scanned_contract.pdf")?;

    // Detect which pages are image-only (no text layer)
    let scanned_pages: Vec<u32> = doc.pages()
        .filter(|p| p.is_image_only())
        .map(|p| p.index())
        .collect();

    println!("{} of {} pages are scanned", scanned_pages.len(), doc.page_count());

    let mut builder = doc.add_ocr_layer();

    for page_index in &scanned_pages {
        // Extract the page as a 300 DPI PNG for OCR
        let img = doc.render_page(*page_index, Default::default())?;
        img.save(format!("/tmp/page_{}.png", page_index))?;

        // Run Tesseract and get hOCR output
        Command::new("tesseract")
            .args([
                &format!("/tmp/page_{}.png", page_index),
                &format!("/tmp/page_{}", page_index),
                "-l", "eng", "hocr",
            ])
            .status()?;

        let hocr = std::fs::read_to_string(
            format!("/tmp/page_{}.hocr", page_index)
        )?;

        // Write invisible text overlay back into the page
        let result = HocrResult::parse(&hocr)?;
        builder.add_page(*page_index, result);
    }

    let opts = OcrLayerOptions::builder()
        .text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
        .conform_to_pdfa2b(true)
        .build();

    let searchable = builder.finish(opts)?;
    searchable.save("scanned_contract_searchable.pdf")?;

    println!("Saved searchable PDF with {} OCR pages", scanned_pages.len());
    Ok(())
}

Run cargo add pdfluent to get started.

What it does

Built-in OCR engine (ocrs)

PDFluent ships with the ocrs engine: a pure-Rust OCR implementation that runs fully offline with no external process. It adds roughly 4 MB to the binary and is WASM-compatible. Good for on-premise deployments, air-gapped environments, and browser-side OCR.

Cloud adapter interface

Swap the engine without changing your PDFluent code. Adapter crates are available for Google Cloud Vision, AWS Textract, and Azure AI Document Intelligence. Each implements the OcrEngine trait — one API, different accuracy and cost tradeoffs per adapter.

WASM-compatible offline OCR

The built-in ocrs engine compiles to WASM. Run OCR in the browser, in Cloudflare Workers, or on embedded targets without a server round-trip or an external OCR binary dependency.

Scanned page detection

Identify pages that contain no text layer and consist entirely of raster images. The is_image_only() check examines the page content stream for text operators, skipping pages that already have a searchable layer to avoid redundant OCR work.

Page image extraction

Render any page to a PNG or TIFF at a configurable DPI for input to an OCR engine. 300 DPI is the standard for OCR input quality. The render call is stateless and can be parallelized across pages using Rayon.

Invisible text layer overlay

Write OCR results back as an invisible text layer (text rendering mode 3) precisely positioned over each word bounding box. The text is not visible when printing or viewing but is fully selectable, searchable, and copyable.

hOCR and ALTO XML ingestion

Parse hOCR output from Tesseract (and compatible engines) and ALTO XML output from ABBYY, Tesseract, and Transkribus. Both formats provide word-level bounding boxes that PDFluent maps to PDF coordinate space for the text overlay.

PDF/A-2b output after OCR

After adding the OCR text layer, optionally convert the output to PDF/A-2b for long-term archival. This is the required output format for scanned document workflows in many government and legal contexts.

Deployment options

Server-side (Linux/macOS/Windows)AWS Lambda (built-in engine or Textract adapter)WASM / browser (built-in engine)DockerKubernetesAir-gapped / on-premise (built-in engine)Self-hosted pipeline

Frequently asked questions