Benchmarks/Text extraction

Text extraction benchmark

Extracting plain text from 1,000 pages across 20 documents. Measured with PDFluent, PyMuPDF, and pdfminer.six.

Results

Library	Total time (ms)	Pages/sec	Peak RAM (MB)	Notes
PDFluent 0.9 (Rust)	890	1,124	62
PyMuPDF 1.24 (C/Python)	1,340	746	145	Wraps MuPDF C library
pdfminer.six 20231228 (Python)	18,700	53	210	Pure Python, no native ext

1.5x

faster than PyMuPDF

21x

faster than pdfminer

2.3x

less peak memory vs PyMuPDF

Benchmark code

PDFluent (Rust)

use pdfluent::PdfDocument;

fn extract_all(paths: &[&str]) -> pdfluent::Result<()> {
    for path in paths {
        let doc = PdfDocument::open(path)?;
        for i in 0..doc.page_count() {
            let _text = doc.page(i)?.text()?;
        }
    }
    Ok(())
}

PyMuPDF (Python)

import fitz  # pymupdf

def extract_all(paths):
    for path in paths:
        doc = fitz.open(path)
        for page in doc:
            _ = page.get_text()
        doc.close()

Note: These numbers are internal estimates based on pre-release builds of PDFluent. We will publish verified benchmarks with reproducible methodology at general availability. The directional differences reflect well-established characteristics of Rust vs Python-based parsing.

Methodology

Test corpus: 20 PDFs with 50 pages each (1,000 total pages). Documents are text-heavy with embedded fonts, sourced from publicly available government and legal document collections. No scanned images.

Task: Extract plain text from every page in document order. Output is discarded (not written to disk) to isolate extraction performance.

Environment: AWS c6i.2xlarge (8 vCPU, 16 GB RAM, Linux 6.1). Single-threaded. Python 3.12. 5 warm-up runs discarded, 30 timed runs averaged.

Versions: PDFluent 0.9.0, PyMuPDF 1.24.0, pdfminer.six 20231228.

Note on PyMuPDF: PyMuPDF wraps the MuPDF C library and is the fastest Python PDF library available. The gap reflects Rust-vs-C overhead plus Python binding overhead, not algorithm differences.

More benchmarks

Merge benchmark Rendering benchmark How to extract text

Download PDFluent