Does this work for multi-page PDFs?

Yes. The example iterates over all scanned pages, calls Textract per page, and accumulates results before writing the final PDF. For very large documents, use the async StartDocumentAnalysis API with an S3 source instead of sending individual pages inline.

Can I use Textract table detection to extract table data?

Yes. Call StartDocumentAnalysis with FeatureType::Tables. Textract returns TABLE, CELL, and WORD blocks with parent-child relationships. You can use the WORD blocks for the PDFluent text overlay and separately process the TABLE/CELL blocks to extract structured table data from the document.

What IAM permissions does my role need?

For DetectDocumentText: textract:DetectDocumentText. For the async job API with S3: textract:StartDocumentAnalysis, textract:GetDocumentAnalysis, and s3:GetObject on the bucket containing your PDFs.

How do I run this on AWS Lambda?

The Rust binary compiles to a Lambda handler. The Lambda execution role needs the Textract IAM permissions above. PDFluent has no external dependencies, so no layers are needed. Use the arm64 target for best cost efficiency on Lambda.

PDFluentSDK

← Editor Download

How-to guides/OCR

Make scanned PDFs searchable with AWS Textract

Extract each page as an image with PDFluent, send it to AWS Textract for OCR, then write back an invisible text layer. Works for printed text, tables, and forms.

rust

use aws_sdk_textract::Client;
use pdfluent::{Sdk, ocr::{OcrLayerOptions, OcrWord}};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let sdk = Sdk::new()?;
    let doc = sdk.open("scanned_invoice.pdf")?;

    let config = aws_config::load_from_env().await;
    let textract = Client::new(&config);

    let mut builder = doc.add_ocr_layer();

    for page in doc.pages().filter(|p| p.is_image_only()) {
        // Render page to PNG bytes at 300 DPI
        let png_bytes = doc.render_page_to_bytes(page.index(), 300)?;

        // Call Textract synchronous DetectDocumentText
        let resp = textract
            .detect_document_text()
            .document(
                aws_sdk_textract::types::Document::builder()
                    .bytes(aws_sdk_textract::primitives::Blob::new(png_bytes.clone()))
                    .build(),
            )
            .send()
            .await?;

        // Convert Textract blocks to PDFluent OcrWord list
        let words: Vec<OcrWord> = resp
            .blocks()
            .iter()
            .filter(|b| b.block_type() == Some(&aws_sdk_textract::types::BlockType::Word))
            .filter_map(|b| {
                let bbox = b.geometry()?.bounding_box()?;
                let text = b.text()?.to_string();
                Some(OcrWord {
                    text,
                    // Textract returns fractions of page width/height
                    left: bbox.left() as f64,
                    top: bbox.top() as f64,
                    width: bbox.width() as f64,
                    height: bbox.height() as f64,
                    confidence: b.confidence().map(|c| c as f64),
                })
            })
            .collect();

        builder.add_page_words(page.index(), words);
    }

    let opts = OcrLayerOptions::builder()
        .text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
        .build();

    let searchable = builder.finish(opts)?;
    searchable.save("invoice_searchable.pdf")?;

    println!("Done.");
    Ok(())
}

Install:cargo add [email protected]Download SDK →

Step by step

Add dependencies

You need PDFluent, the AWS SDK for Rust, tokio for async, and anyhow for error handling.

rust

# Cargo.toml
[dependencies]
pdfluent = "0.9"
aws-config = { version = "1", features = ["behavior-version-latest"] }
aws-sdk-textract = "1"
tokio = { version = "1", features = ["full"] }
anyhow = "1"

Configure AWS credentials

Textract uses standard AWS credential resolution. Set environment variables or use an IAM role if running on EC2 or Lambda.

rust

export AWS_ACCESS_KEY_ID=your_key_id
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1

Open the PDF and identify scanned pages

PDFluent detects pages that have no text layer. Only those pages need OCR — pages with existing text are passed through unchanged.

rust

let sdk = Sdk::new()?;
let doc = sdk.open("scanned_invoice.pdf")?;

let scanned: Vec<u32> = doc.pages()
    .filter(|p| p.is_image_only())
    .map(|p| p.index())
    .collect();

println!("{} pages need OCR", scanned.len());

Render each page and call Textract DetectDocumentText

Render the page to PNG bytes at 300 DPI and send them directly to Textract. The synchronous DetectDocumentText call works for pages up to 10 MB; use StartDocumentTextDetection for larger documents.

rust

let png_bytes = doc.render_page_to_bytes(page_index, 300)?;

let resp = textract
    .detect_document_text()
    .document(
        aws_sdk_textract::types::Document::builder()
            .bytes(aws_sdk_textract::primitives::Blob::new(png_bytes))
            .build(),
    )
    .send()
    .await?;

Use the async job API for large documents

For multi-page PDFs over 10 MB, or when you want table and form field detection, use StartDocumentAnalysis. It processes the document asynchronously and returns a JobId you poll until the status is SUCCEEDED.

rust

// Start async job (supports TABLES and FORMS feature types)
let start_resp = textract
    .start_document_analysis()
    .document_location(
        aws_sdk_textract::types::DocumentLocation::builder()
            .s3_object(
                aws_sdk_textract::types::S3Object::builder()
                    .bucket("my-bucket")
                    .name("scanned_invoice.pdf")
                    .build(),
            )
            .build(),
    )
    .feature_types(aws_sdk_textract::types::FeatureType::Tables)
    .feature_types(aws_sdk_textract::types::FeatureType::Forms)
    .send()
    .await?;

let job_id = start_resp.job_id().unwrap();

// Poll until complete
loop {
    let status_resp = textract
        .get_document_analysis()
        .job_id(job_id)
        .send()
        .await?;

    match status_resp.job_status() {
        Some(aws_sdk_textract::types::JobStatus::Succeeded) => break,
        Some(aws_sdk_textract::types::JobStatus::Failed) => anyhow::bail!("Textract job failed"),
        _ => tokio::time::sleep(std::time::Duration::from_secs(2)).await,
    }
}

Convert Textract WORD blocks to PDFluent OcrWord list

Textract returns bounding boxes as fractions of the page (0.0–1.0). PDFluent accepts this format directly in OcrWord. Filter for BlockType::Word to get word-level entries.

rust

let words: Vec<OcrWord> = resp
    .blocks()
    .iter()
    .filter(|b| b.block_type() == Some(&aws_sdk_textract::types::BlockType::Word))
    .filter_map(|b| {
        let bbox = b.geometry()?.bounding_box()?;
        Some(OcrWord {
            text: b.text()?.to_string(),
            left: bbox.left() as f64,
            top: bbox.top() as f64,
            width: bbox.width() as f64,
            height: bbox.height() as f64,
            confidence: b.confidence().map(|c| c as f64),
        })
    })
    .collect();

builder.add_page_words(page_index, words);

Finish and save the searchable PDF

Call builder.finish() with the layer options to produce a new PDF with the invisible text overlay applied to all processed pages.

rust

let opts = OcrLayerOptions::builder()
    .text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
    .conform_to_pdfa2b(true) // optional: archive-safe output
    .build();

let searchable = builder.finish(opts)?;
searchable.save("invoice_searchable.pdf")?;

Notes and tips

Textract DetectDocumentText supports PNG and JPEG. PDFluent renders to PNG by default.
The synchronous API accepts documents up to 10 MB. For larger pages, reduce DPI (150 DPI is usually sufficient for printed text) or use the S3-backed async API.
Textract bounding boxes are normalized fractions of image width/height. PDFluent's add_page_words() expects the same format, so no coordinate conversion is needed.
Textract charges per page. As of 2024: $0.0015 per page for DetectDocumentText, $0.015 per page for AnalyzeDocument with FORMS or TABLES.
If you only need text extraction (not table structure), DetectDocumentText is 10x cheaper than AnalyzeDocument.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent PDF OCR solutions overview

Make scanned PDFs searchable with AWS Textract

Step by step

Add dependencies

Configure AWS credentials

Open the PDF and identify scanned pages

Render each page and call Textract DetectDocumentText

Use the async job API for large documents

Convert Textract WORD blocks to PDFluent OcrWord list

Finish and save the searchable PDF

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides