Does PDFluent extract tables from scanned PDFs?

No. Scanned PDFs contain images, not text. Run the image through an OCR tool first to produce a text-layer PDF, then use extract_tables() on the result.

What if a table spans multiple pages?

extract_tables() operates on one page at a time. For multi-page tables, extract the last rows of one page and the first rows of the next, then merge them in your code. PDFluent does not automatically stitch cross-page tables.

How accurate is table detection for borderless tables?

Borderless tables rely on whitespace analysis. Accuracy depends on consistent column alignment. Results are usually good for well-formatted financial tables, but may miss columns in loosely spaced text.

Can I get the table bounding box?

Yes. Table.bounds returns a Rect with x, y, width, and height in PDF points. This is useful for visual debugging or for mapping table regions to rendered images.

PDFluentSDK

← Editor Download

How-to guides/Text Extraction

Extract table data from PDFs in Rust

Detect and extract structured table data from PDF pages. Get rows and cells as Rust values without writing custom parsing logic.

Not in the current release. This capability is not part of the published PDFluent SDK; the example below does not compile against the current release. See the changelog for what ships today.

rust

// Planned 1.1 API — not available in pdfluent 1.0.
// For 1.0, use `page.text()` and parse the result manually.
use pdfluent::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let doc = PdfDocument::open("report.pdf")?;
    let page = doc.page(1)?;

    for table in page.extract_tables()? {
        for row in &table.rows {
            let cells: Vec<&str> = row.iter()
                .map(|c| c.text.as_str())
                .collect();
            println!("{}", cells.join(" | "));
        }
    }
    Ok(())
}

Install:cargo add [email protected]Download SDK →

Step by step

Open the PDF and access the page

Table extraction works on a per-page basis. Open the document and select the page that contains the table.

rust

use pdfluent::PdfDocument;

let doc = PdfDocument::open("financial_report.pdf")?;
let page = doc.page(1)?; // 0-indexed, so this is page 2

Extract all tables from a page

extract_tables() returns a Vec<Table>. Each Table has a rows field: a Vec<Vec<TableCell>>. Cells span columns if they have a colspan greater than 1.

rust

let tables = page.extract_tables()?;
println!("Found {} table(s) on this page", tables.len());

Iterate rows and cells

Each TableCell contains the text content and the column span. Iterate rows and cells to process the data.

rust

for (ti, table) in tables.iter().enumerate() {
    println!("Table {}: {} rows", ti + 1, table.rows.len());
    for row in &table.rows {
        for cell in row {
            print!("[{}] ", cell.text.trim());
        }
        println!();
    }
}

Export a table to CSV

Write a simple CSV from the extracted rows. Use the csv crate for proper quoting.

rust

use std::io::Write;

let mut out = std::fs::File::create("table.csv")?;
for row in &tables[0].rows {
    let line = row.iter()
        .map(|c| format!(""{}"", c.text.replace('"', """")))
        .collect::<Vec<_>>()
        .join(",");
    writeln!(out, "{}", line)?;
}

Tune table detection

Use TableExtractionOptions to adjust the line-merge tolerance and minimum cell size, which helps with tables that have thin or invisible borders.

rust

use pdfluent::TableExtractionOptions;

let opts = TableExtractionOptions::default()
    .line_tolerance(2.0)
    .min_cell_width(20.0);

let tables = page.extract_tables_with_options(&opts)?;

Notes and tips

Table detection uses both ruling lines and whitespace-gap analysis. Documents with well-defined borders produce the most accurate results.
Merged cells (rowspan/colspan) are detected and reported in the TableCell.colspan and TableCell.rowspan fields.
For pages with multiple tables, each Table value includes its bounding box so you can identify which table on the page it corresponds to.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent

Extract table data from PDFs in Rust

Step by step

Open the PDF and access the page

Extract all tables from a page

Iterate rows and cells

Export a table to CSV

Tune table detection

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides