Extracting plain text from 1,000 pages across 20 documents. Measured with PDFluent, PyMuPDF, and pdfminer.six.
| Library | Total time (ms) | Pages/sec | Peak RAM (MB) | Notes |
|---|---|---|---|---|
| PDFluent 0.9 (Rust) | 890 | 1,124 | 62 | |
| PyMuPDF 1.24 (C/Python) | 1,340 | 746 | 145 | Wraps MuPDF C library |
| pdfminer.six 20231228 (Python) | 18,700 | 53 | 210 | Pure Python, no native ext |
use pdfluent::Document;
fn extract_all(paths: &[&str]) -> pdfluent::Result<()> {
for path in paths {
let doc = Document::open(path)?;
for i in 0..doc.page_count() {
let _text = doc.page(i)?.extract_text()?;
}
}
Ok(())
}import fitz # pymupdf
def extract_all(paths):
for path in paths:
doc = fitz.open(path)
for page in doc:
_ = page.get_text()
doc.close()Note: These numbers are internal estimates based on pre-release builds of PDFluent. We will publish verified benchmarks with reproducible methodology at general availability. The directional differences reflect well-established characteristics of Rust vs Python-based parsing.
Test corpus: 20 PDFs with 50 pages each (1,000 total pages). Documents are text-heavy with embedded fonts, sourced from publicly available government and legal document collections. No scanned images.
Task: Extract plain text from every page in document order. Output is discarded (not written to disk) to isolate extraction performance.
Environment: AWS c6i.2xlarge (8 vCPU, 16 GB RAM, Linux 6.1). Single-threaded. Python 3.12. 5 warm-up runs discarded, 30 timed runs averaged.
Versions: PDFluent 0.9.0, PyMuPDF 1.24.0, pdfminer.six 20231228.
Note on PyMuPDF: PyMuPDF wraps the MuPDF C library and is the fastest Python PDF library available. The gap reflects Rust-vs-C overhead plus Python binding overhead, not algorithm differences.