How-to guides/Text Extraction

Extract text page by page from a PDF in Rust

Read the text content of each page as a plain string or as structured spans with font and position data.

rust
use pdfluent::Document;

fn main() -> pdfluent::Result<()> {
    let doc = Document::open("input.pdf")?;

    for (i, page) in doc.pages().enumerate() {
        let text = page.extract_text()?;
        println!("=== Page {} ===", i + 1);
        println!("{}", text);
    }

    Ok(())
}
Install:cargo add pdfluentDownload SDK →

Step by step

1

Open the document

Text extraction only requires read access.

rust
let doc = Document::open("input.pdf")?;
2

Extract plain text from a single page

extract_text() reconstructs reading order using glyph positions and returns a plain String. Words are separated by spaces; paragraphs by newlines.

rust
let page = doc.page(0)?;
let text = page.extract_text()?;
println!("{}", text);
3

Extract structured spans

extract_spans() returns each text run as a TextSpan with font name, font size, and bounding Rect. Useful for layout analysis.

rust
for span in page.extract_spans()? {
    println!(
        ""{}" font={} size={:.1}pt at {:?}",
        span.text,
        span.font_name,
        span.font_size,
        span.rect,
    );
}
4

Extract text from all pages

Iterate over all pages and collect text per page into a Vec.

rust
let pages_text: Vec<String> = doc
    .pages()
    .map(|p| p.extract_text().unwrap_or_default())
    .collect();
5

Write extracted text to files

Save each page as a separate text file, or join them with newlines for a single output file.

rust
for (i, text) in pages_text.iter().enumerate() {
    std::fs::write(format!("page_{:03}.txt", i + 1), text)?;
}

Notes and tips

  • Text extraction follows the PDF content stream order, which may differ from visual reading order in multi-column layouts. Use extract_spans() and sort by rect position for precise column order.
  • Characters with custom encoding or Type3 fonts may not map cleanly to Unicode. PDFluent uses ToUnicode maps where available.
  • Encrypted PDFs must be opened with Document::open_with_password before text extraction.
  • For scanned PDFs without text layer, extract_text() returns an empty string. You need OCR for image-based documents.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions