How-to guides/Text Extraction

Extract text from a PDF in Rust

Read all text content from a PDF document. PDFluent preserves reading order and handles multi-column layouts, right-to-left scripts, and CID fonts.

rust
use pdfluent::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = PdfDocument::open("document.pdf")?;

    for (i, page) in doc.pages().enumerate() {
        let text = page.extract_text()?;
        println!("--- Page {} ---", i + 1);
        println!("{}", text);
    }
    Ok(())
}
Install:cargo add pdfluentDownload SDK →

Step by step

1

Open the document

Load the PDF. Text extraction works page by page, so memory usage stays low even for large documents.

rust
use pdfluent::PdfDocument;

let doc = PdfDocument::open("contract.pdf")?;
2

Extract text from a single page

Access a page by its 0-based index and call extract_text(). The method returns a plain String with words separated by spaces and paragraphs separated by newlines.

rust
let page = doc.page(0)?;
let text = page.extract_text()?;
println!("{}", text);
3

Extract text from all pages

Iterate over doc.pages() to process every page. Each call to extract_text() is independent.

rust
let full_text: String = doc
    .pages()
    .map(|p| p.extract_text().unwrap_or_default())
    .collect::<Vec<_>>()
    .join("

");
4

Control reading order

The default extraction follows the PDF content stream order. For multi-column documents, use TextExtractionOptions to enable layout analysis.

rust
use pdfluent::{TextExtractionOptions, ReadingOrder};

let opts = TextExtractionOptions::default()
    .reading_order(ReadingOrder::LayoutAnalysis);

let text = doc.page(0)?.extract_text_with_options(&opts)?;
5

Extract text as structured lines

Use extract_lines() to get a Vec<TextLine> where each entry contains the string and its approximate vertical position.

rust
for line in doc.page(0)?.extract_lines()? {
    println!("[y={:.1}] {}", line.y, line.text);
}

Notes and tips

  • PDFluent decodes ToUnicode CMaps and Type1/TrueType encodings automatically.
  • Scanned PDFs with no embedded text return empty strings. Use an OCR step before extraction if needed.
  • Right-to-left text (Arabic, Hebrew) is returned in logical order, not visual order.
  • Ligatures and composed characters are decomposed to their Unicode equivalents where a mapping exists.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions