Built-in Rust OCR engine or connect to Google Cloud Vision, AWS Textract, or Azure AI — your choice. PDFluent handles the PDF side either way.
use pdfluent::{Sdk, ocr::{OcrLayerOptions, HocrResult}};
use std::process::Command;
fn main() -> pdfluent::Result<()> {
let sdk = Sdk::new()?;
let doc = sdk.open("scanned_contract.pdf")?;
// Detect which pages are image-only (no text layer)
let scanned_pages: Vec<u32> = doc.pages()
.filter(|p| p.is_image_only())
.map(|p| p.index())
.collect();
println!("{} of {} pages are scanned", scanned_pages.len(), doc.page_count());
let mut builder = doc.add_ocr_layer();
for page_index in &scanned_pages {
// Extract the page as a 300 DPI PNG for OCR
let img = doc.render_page(*page_index, Default::default())?;
img.save(format!("/tmp/page_{}.png", page_index))?;
// Run Tesseract and get hOCR output
Command::new("tesseract")
.args([
&format!("/tmp/page_{}.png", page_index),
&format!("/tmp/page_{}", page_index),
"-l", "eng", "hocr",
])
.status()?;
let hocr = std::fs::read_to_string(
format!("/tmp/page_{}.hocr", page_index)
)?;
// Write invisible text overlay back into the page
let result = HocrResult::parse(&hocr)?;
builder.add_page(*page_index, result);
}
let opts = OcrLayerOptions::builder()
.text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
.conform_to_pdfa2b(true)
.build();
let searchable = builder.finish(opts)?;
searchable.save("scanned_contract_searchable.pdf")?;
println!("Saved searchable PDF with {} OCR pages", scanned_pages.len());
Ok(())
}Run cargo add pdfluent to get started.
PDFluent ships with the ocrs engine: a pure-Rust OCR implementation that runs fully offline with no external process. It adds roughly 4 MB to the binary and is WASM-compatible. Good for on-premise deployments, air-gapped environments, and browser-side OCR.
Swap the engine without changing your PDFluent code. Adapter crates are available for Google Cloud Vision, AWS Textract, and Azure AI Document Intelligence. Each implements the OcrEngine trait — one API, different accuracy and cost tradeoffs per adapter.
The built-in ocrs engine compiles to WASM. Run OCR in the browser, in Cloudflare Workers, or on embedded targets without a server round-trip or an external OCR binary dependency.
Identify pages that contain no text layer and consist entirely of raster images. The is_image_only() check examines the page content stream for text operators, skipping pages that already have a searchable layer to avoid redundant OCR work.
Render any page to a PNG or TIFF at a configurable DPI for input to an OCR engine. 300 DPI is the standard for OCR input quality. The render call is stateless and can be parallelized across pages using Rayon.
Write OCR results back as an invisible text layer (text rendering mode 3) precisely positioned over each word bounding box. The text is not visible when printing or viewing but is fully selectable, searchable, and copyable.
Parse hOCR output from Tesseract (and compatible engines) and ALTO XML output from ABBYY, Tesseract, and Transkribus. Both formats provide word-level bounding boxes that PDFluent maps to PDF coordinate space for the text overlay.
After adding the OCR text layer, optionally convert the output to PDF/A-2b for long-term archival. This is the required output format for scanned document workflows in many government and legal contexts.