Benchmarks/Text extraction

Text extraction benchmark

Extracting plain text from 1,000 pages across 20 documents. Measured with PDFluent, PyMuPDF, and pdfminer.six.

Results

LibraryTotal time (ms)Pages/secPeak RAM (MB)Notes
PDFluent 0.9 (Rust)8901,12462
PyMuPDF 1.24 (C/Python)1,340746145Wraps MuPDF C library
pdfminer.six 20231228 (Python)18,70053210Pure Python, no native ext
1.5x
faster than PyMuPDF
21x
faster than pdfminer
2.3x
less peak memory vs PyMuPDF

Benchmark code

PDFluent (Rust)
use pdfluent::Document;

fn extract_all(paths: &[&str]) -> pdfluent::Result<()> {
    for path in paths {
        let doc = Document::open(path)?;
        for i in 0..doc.page_count() {
            let _text = doc.page(i)?.extract_text()?;
        }
    }
    Ok(())
}
PyMuPDF (Python)
import fitz  # pymupdf

def extract_all(paths):
    for path in paths:
        doc = fitz.open(path)
        for page in doc:
            _ = page.get_text()
        doc.close()

Note: These numbers are internal estimates based on pre-release builds of PDFluent. We will publish verified benchmarks with reproducible methodology at general availability. The directional differences reflect well-established characteristics of Rust vs Python-based parsing.

Methodology

Test corpus: 20 PDFs with 50 pages each (1,000 total pages). Documents are text-heavy with embedded fonts, sourced from publicly available government and legal document collections. No scanned images.

Task: Extract plain text from every page in document order. Output is discarded (not written to disk) to isolate extraction performance.

Environment: AWS c6i.2xlarge (8 vCPU, 16 GB RAM, Linux 6.1). Single-threaded. Python 3.12. 5 warm-up runs discarded, 30 timed runs averaged.

Versions: PDFluent 0.9.0, PyMuPDF 1.24.0, pdfminer.six 20231228.

Note on PyMuPDF: PyMuPDF wraps the MuPDF C library and is the fastest Python PDF library available. The gap reflects Rust-vs-C overhead plus Python binding overhead, not algorithm differences.