All posts
PerformanceMarch 10, 20268 min readby Jasper de Winter

PDFluent vs. iText: A Performance Comparison

BenchmarkiTextPerformanceRustJVM

Benchmarks are easy to manipulate and hard to interpret. This post attempts to do a fair comparison between PDFluent and iText 9 for server-side PDF processing. We explain exactly how we measured, what the numbers mean, and where iText is legitimately better.

One upfront note: iText has no rendering engine. It cannot rasterize a PDF to an image. Any benchmark involving rendering is PDFluent-only, so we exclude those entirely and focus on the operations both libraries support.

Test environment

text
Instance:    AWS c6i.2xlarge
CPU:         Intel Xeon Ice Lake, 8 vCPU @ 3.5 GHz
RAM:         16 GB
OS:          Ubuntu 22.04 LTS
PDFluent:    v1.0 (current release)
iText:       9.1.0 (Java), OpenJDK 21.0.2 (GraalVM)
JVM flags:   -Xms512m -Xmx4g -XX:+UseG1GC
Warmup:      10 runs discarded
Measurement: median of 100 runs
PDF corpus:  100 files, 1–50 pages, varied content

We used GraalVM (not standard OpenJDK) to give iText the best possible JVM performance. Standard OpenJDK JIT typically runs 15–25% slower than GraalVM for this type of workload.

Cold start

"Cold start" means: from process launch to first byte of output. This matters for serverless functions, short-lived containers, and any deployment where you can't keep a warm process pool running.

PDFluentiText (GraalVM)iText (std. JDK 21)
Process start → ready8ms820ms1,100ms
First document parsed+12ms+35ms+40ms
Total cold start~20ms~855ms~1,140ms

The JVM startup time dominates iText's cold start. GraalVM's ahead-of-time compilation (GraalVM Native Image) can reduce this significantly — we tested that separately:

PDFluentiText (GraalVM Native Image)
Process start → ready8ms85ms
First document parsed+12ms+45ms
Total cold start~20ms~130ms

GraalVM Native Image brings iText much closer. If you compile iText to a native binary, the cold start gap shrinks from ~40× to ~6×. The tradeoff: GraalVM Native Image has restrictions (no dynamic class loading, reflection requires configuration) that can be difficult to satisfy with a complex library like iText.

PDF parsing throughput

Parse 1,000 PDFs (varied sizes, 1–50 pages each). Measure total wall time and peak RSS.

PDFluentiText 9
Total time (sequential)18.2s61.4s
Total time (parallel, 8 threads)3.1s9.8s
Peak RSS480 MB1,940 MB
Throughput (sequential)54.9 docs/s16.3 docs/s
Throughput (parallel)322 docs/s102 docs/s

PDFluent's memory advantage is significant: 480MB vs 1,940MB for the same workload. This is partly the JVM overhead (heap metadata, GC bookkeeping, class metadata) and partly iText's object model, which keeps more of the document in memory during parsing.

Memory profile

We also measured peak memory per document, not just batch totals:

Document sizePDFluent peakiText peak
Simple, 5 pages8 MB85 MB
Complex, 20 pages42 MB210 MB
Large, 100 pages180 MB780 MB
Scanned (image-heavy), 50 pages320 MB1,200 MB

The scanned document case shows the biggest gap: PDFluent decodes image streams lazily and frees them after processing; iText's parser holds more of the object graph in the Java heap.

Text extraction

PDFluentiText 9
100 simple docs (text-only)12.1s41.8s
100 complex docs (mixed)19.4s68.2s
100 CJK docs (multi-byte fonts)24.7s89.3s
Output quality — simpleGoodGood
Output quality — complex layoutModerateGood

iText's text extraction quality for complex layouts (multi-column, tables, mixed writing directions) is better than ours. iText uses a spatial clustering algorithm that produces more accurate reading-order reconstruction. Our extractor works well for simple layouts but can produce out-of-order text for documents with overlapping text blocks or unusual column structures.

PDF/A validation

iText is the reference implementation for PDF/A processing. Their validation engine is co-developed with the veraPDF team and is considered the most accurate PDF/A validator available. We validate against the same veraPDF conformance checker to compare our results.

Test setPDFluent (speed)iText (speed)PDFluent accuracyiText accuracy
100 conformant docs3.5s28.4s100%100%
isartor test suite (400 docs)14.2s112.8s98.1%99.9%
Custom corpus (1,000 docs)36.1s287.2s99.5%99.8%

We're faster; they're more accurate. The 1.8% accuracy gap on the isartor test suite corresponds to 7 documents where we produce a false negative (we say conformant; the document isn't) or false positive (we flag a violation that isn't there). iText has one false result on this corpus.

For most production use cases, 98.1% accuracy is acceptable. For a workflow where you're certifying documents for legal archiving, that 1.9% matters — in that case, route the edge cases through iText's validation or use veraPDF directly.

ZUGFeRD / Factur-X

iText has good ZUGFeRD/Factur-X support via their pdfHTML and ZUGFeRD-specific API. PDFluent's pdf-invoice crate covers the same ground.

PDFluentiText 9
Generate ZUGFeRD EN1693114ms48ms
Validate ZUGFeRD embedding8ms22ms
Extract XML from Factur-X PDF6ms18ms
Correctness (schema + schematron)100%100%

Reproducing these benchmarks

The benchmark code is on GitHub. To run it yourself:

bash
# Public benchmark corpus is not yet open-source.
# Mail [email protected] for the methodology and corpus access.
# Once published, the snippet below is the entry point:
cd benchmarks

# Install dependencies
cargo build --release
mvn -f itext-bench/pom.xml package

# Generate test corpus (requires ~2GB disk)
./scripts/generate-corpus.sh

# Run benchmarks
./scripts/run-all.sh --output results.json

# View results
./scripts/report.py results.json

The corpus generator creates PDFs using a mix of open-source tools (Ghostscript, LibreOffice, LaTeX) so the test set is reproducible. If you get substantially different results on your hardware, open an issue — we want to know.

Summary

CategoryWinnerNotes
Cold startPDFluent~40× faster on standard JDK; ~6× on GraalVM Native Image
Parse throughputPDFluent~3.4× faster sequential, ~3.2× parallel
Memory usagePDFluent~4× less RAM per document
Text extraction qualityiTextBetter multi-column / complex layout handling
PDF/A accuracyiTextCo-developed with veraPDF; marginally more accurate
PDF/A speedPDFluent~8× faster validation
PDF renderingPDFluentiText has no rendering engine
XFA processingPDFluentiText only flattens via pdfXFA add-on
ZUGFeRD/Factur-XPDFluentBoth correct; PDFluent ~3× faster
Java/JVM integrationiText25 years of ecosystem; Maven-native
When to use iText: You're on the JVM, you need best-in-class PDF/A accuracy, or you need complex text extraction from multi-column documents. The AGPL license is also genuinely useful if you're building open-source software.

When to use PDFluent: You need fast cold starts (serverless), low memory footprint (high concurrency), PDF rendering, XFA forms, WebAssembly deployment, or non-JVM language bindings.
All posts