Why does my extracted text have garbled characters?

The PDF font has no ToUnicode CMap and uses a non-standard encoding. PDFluent falls back to glyph-name heuristics in this case. The result is best-effort. If accuracy is required, consider running the page through an OCR layer as a fallback.

Can I extract text from a password-protected PDF?

Yes, after decrypting it. Pass the password through OpenOptions: PdfDocument::open_with("file.pdf", OpenOptions::new().with_password("...")). Then call text() normally.

Does PDFluent extract text from XFA forms?

text() returns static page content. XFA form field values are exposed separately via doc.form_fields() (the read-side accessor) — this covers AcroForms in 1.0; full XFA field enumeration is tracked for a later release.

How does PDFluent handle rotated pages?

Page rotation is normalised before extraction. Text on a 90-degree rotated page is returned in the correct reading order, not in the raw stream order.

PDFluentSDK

← Editor Download

How-to guides/Text Extraction

Extract text from a PDF in Rust

Read all text content from a PDF document. PDFluent preserves reading order and handles multi-column layouts, right-to-left scripts, and CID fonts.

rust

use pdfluent::prelude::*;

fn main() -> Result<()> {
    let doc = PdfDocument::open("document.pdf")?;

    for page in doc.pages() {
        let text = page.text()?;
        println!("--- Page {} ---", page.number());
        println!("{}", text);
    }
    Ok(())
}

Install:cargo add [email protected]Download SDK →

Step by step

Open the document

Load the PDF. Text extraction works page by page, so memory usage stays low even for large documents.

rust

use pdfluent::prelude::*;

let doc = PdfDocument::open("contract.pdf")?;

Extract text from a single page

Access a page by its 1-based index and call text(). The method returns a plain String with words separated by spaces and paragraphs separated by newlines.

rust

let page = doc.page(1)?;
let text = page.text()?;
println!("{}", text);

Extract text from all pages

Iterate over doc.pages() to process every page. Each call to text() is independent.

rust

let full_text: String = doc
    .pages()
    .map(|p| p.text().unwrap_or_default())
    .collect::<Vec<_>>()
    .join("\n\n");

Extract text with layout positions

Use doc.text_with_layout() to get a Vec<TextBlock> at the document level. Each block carries the text, the page number, and the bounding box in PDF points (bottom-left origin).

rust

for block in doc.text_with_layout()? {
    println!(
        "[page {}] [{:.1},{:.1}] {:?}",
        block.page, block.x, block.y, block.text,
    );
}

Notes and tips

PDFluent decodes ToUnicode CMaps and Type1/TrueType encodings automatically.
Scanned PDFs with no embedded text return empty strings. Use an OCR step before extraction if needed.
Right-to-left text (Arabic, Hebrew) is returned in logical order, not visual order.
Ligatures and composed characters are decomposed to their Unicode equivalents where a mapping exists.
Page indexing is 1-based throughout the SDK (RFC 0001 §1).

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent Text extraction at scale: see the solutions page

Extract text from a PDF in Rust

Step by step

Open the document

Extract text from a single page

Extract text from all pages

Extract text with layout positions

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides