What is the difference between TEXT_DETECTION and DOCUMENT_TEXT_DETECTION?

TEXT_DETECTION is optimized for sparse text in natural images (signs, labels). DOCUMENT_TEXT_DETECTION uses a different model optimized for dense document text and returns a fullTextAnnotation with paragraph and word hierarchy, which gives more accurate word-level bounding boxes for the PDFluent overlay.

Can I use a service account instead of an API key?

Yes. With a service account, use the google-cloud-auth crate or the gcp-auth crate to obtain a Bearer token, then pass it in the Authorization header instead of the ?key= query parameter. Service accounts are the recommended approach for production workloads.

Does Google Cloud Vision support handwritten text?

Yes. DOCUMENT_TEXT_DETECTION handles handwriting. For predominantly handwritten documents, you can optionally set the languageHint in the imageContext to improve accuracy.

What happens if a page has both printed and scanned content?

PDFluent's is_image_only() check identifies pages where the entire content is raster images with no text operators in the content stream. Mixed pages (e.g. a scanned image embedded in a page with some existing text) are detected based on whether any text content stream operators exist. You can also force OCR on specific page indices regardless of the is_image_only result.

PDFluentSDK

← Editor Download

How-to guides/OCR

Make scanned PDFs searchable with Google Cloud Vision

Render PDF pages to images with PDFluent, send them to Google Cloud Vision for OCR, and write the results back as an invisible text layer.

rust

use pdfluent::{Sdk, ocr::{OcrLayerOptions, OcrWord}};
use reqwest::Client;
use serde_json::{json, Value};
use base64::{Engine as _, engine::general_purpose::STANDARD as BASE64};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let api_key = std::env::var("GCP_VISION_API_KEY")?;
    let sdk = Sdk::new()?;
    let doc = sdk.open("scanned_contract.pdf")?;

    let http = Client::new();
    let mut builder = doc.add_ocr_layer();

    for page in doc.pages().filter(|p| p.is_image_only()) {
        // Render page to PNG bytes at 300 DPI
        let png_bytes = doc.render_page_to_bytes(page.index(), 300)?;
        let b64 = BASE64.encode(&png_bytes);

        let body = json!({
            "requests": [{
                "image": { "content": b64 },
                "features": [{ "type": "DOCUMENT_TEXT_DETECTION" }]
            }]
        });

        let resp: Value = http
            .post(format!(
                "https://vision.googleapis.com/v1/images:annotate?key={api_key}"
            ))
            .json(&body)
            .send()
            .await?
            .json()
            .await?;

        let words = extract_words(&resp, page.index())?;
        builder.add_page_words(page.index(), words);
    }

    let opts = OcrLayerOptions::builder()
        .text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
        .build();

    let searchable = builder.finish(opts)?;
    searchable.save("contract_searchable.pdf")?;

    println!("Done.");
    Ok(())
}

Install:cargo add [email protected]Download SDK →

Step by step

Add dependencies

You need PDFluent, reqwest for the Vision API call, serde_json, base64, and tokio.

rust

# Cargo.toml
[dependencies]
pdfluent = "0.9"
reqwest = { version = "0.12", features = ["json"] }
serde_json = "1"
base64 = "0.22"
tokio = { version = "1", features = ["full"] }
anyhow = "1"

Set up Google Cloud Vision authentication

The quickest approach for testing is an API key. For production, use a service account with the Cloud Vision API role and the Application Default Credentials flow.

rust

# Option 1: API key (development/testing)
export GCP_VISION_API_KEY=AIzaSy...

# Option 2: Service account (production)
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
# Then use the OAuth2 token endpoint or the google-cloud Rust crates

Open the PDF and identify scanned pages

PDFluent detects pages with no text layer. Pages that already have selectable text are skipped.

rust

let sdk = Sdk::new()?;
let doc = sdk.open("scanned_contract.pdf")?;

let scanned_count = doc.pages().filter(|p| p.is_image_only()).count();
println!("{} of {} pages are scanned", scanned_count, doc.page_count());

Render each page and call DOCUMENT_TEXT_DETECTION

Use DOCUMENT_TEXT_DETECTION rather than TEXT_DETECTION. The DOCUMENT variant returns symbols grouped into words, lines, and paragraphs, which gives better word-level bounding boxes for the PDFluent overlay.

rust

let png_bytes = doc.render_page_to_bytes(page.index(), 300)?;
let b64 = BASE64.encode(&png_bytes);

let body = json!({
    "requests": [{
        "image": { "content": b64 },
        "features": [{ "type": "DOCUMENT_TEXT_DETECTION" }]
    }]
});

let resp: Value = http
    .post(format!("https://vision.googleapis.com/v1/images:annotate?key={api_key}"))
    .json(&body)
    .send()
    .await?
    .json()
    .await?;

Parse the Vision API response into OcrWord entries

The DOCUMENT_TEXT_DETECTION response returns a fullTextAnnotation with pages > blocks > paragraphs > words. Each word has a boundingBox with normalizedVertices. PDFluent needs the bounding box as left/top/width/height fractions.

rust

fn extract_words(resp: &Value, _page_index: u32) -> anyhow::Result<Vec<OcrWord>> {
    let mut words = Vec::new();

    let annotation = &resp["responses"][0]["fullTextAnnotation"];
    let pages = annotation["pages"].as_array().unwrap_or(&vec![]);

    for page in pages {
        for block in page["blocks"].as_array().unwrap_or(&vec![]) {
            for para in block["paragraphs"].as_array().unwrap_or(&vec![]) {
                for word in para["words"].as_array().unwrap_or(&vec![]) {
                    // Reconstruct word text from symbols
                    let text: String = word["symbols"]
                        .as_array()
                        .unwrap_or(&vec![])
                        .iter()
                        .filter_map(|s| s["text"].as_str())
                        .collect();

                    if text.is_empty() { continue; }

                    // normalizedVertices are fractions of image width/height
                    let verts = &word["boundingBox"]["normalizedVertices"];
                    if let (Some(v0), Some(v2)) = (verts.get(0), verts.get(2)) {
                        let left = v0["x"].as_f64().unwrap_or(0.0);
                        let top = v0["y"].as_f64().unwrap_or(0.0);
                        let right = v2["x"].as_f64().unwrap_or(0.0);
                        let bottom = v2["y"].as_f64().unwrap_or(0.0);

                        words.push(OcrWord {
                            text,
                            left,
                            top,
                            width: right - left,
                            height: bottom - top,
                            confidence: word["confidence"].as_f64(),
                        });
                    }
                }
            }
        }
    }

    Ok(words)
}

Write the text layer and save the searchable PDF

Pass the collected words to the layer builder, call finish(), and save. The text is invisible at render time but fully searchable and copyable.

rust

builder.add_page_words(page.index(), words);

// After processing all pages:
let opts = OcrLayerOptions::builder()
    .text_rendering_mode(pdfluent::ocr::TextRenderingMode::Invisible)
    .conform_to_pdfa2b(true) // optional: PDF/A-2b for archival
    .build();

let searchable = builder.finish(opts)?;
searchable.save("contract_searchable.pdf")?;

Notes and tips

DOCUMENT_TEXT_DETECTION is better than TEXT_DETECTION for dense text. It returns a layout-aware annotation with word groupings.
Google Cloud Vision pricing (as of 2024): first 1,000 units/month free, then $1.50 per 1,000 images for text detection.
normalizedVertices are always available for DOCUMENT_TEXT_DETECTION. Regular boundingPoly vertices are in raw pixels and require dividing by image width/height to normalize.
If the image is rotated, Vision returns rotated bounding boxes. PDFluent expects axis-aligned boxes. For rotated documents, normalize the rotation with doc.rotate_page() before rendering.
For long PDFs, batch requests: the Vision API annotate endpoint accepts up to 16 images per request in a single HTTP call.

Why PDFluent for this

Pure Rust

No JVM, no runtime, no DLL dependencies. Ships as a single native binary or WASM module.

Memory safe

Rust's ownership model prevents buffer overflows and use-after-free. No segfaults in PDF parsing.

Runs anywhere

Same code runs server-side, in Docker, on AWS Lambda, on Cloudflare Workers, or in the browser via WASM.

Frequently asked questions

Download PDFluent PDF OCR solutions overview

Make scanned PDFs searchable with Google Cloud Vision

Step by step

Add dependencies

Set up Google Cloud Vision authentication

Open the PDF and identify scanned pages

Render each page and call DOCUMENT_TEXT_DETECTION

Parse the Vision API response into OcrWord entries

Write the text layer and save the searchable PDF

Notes and tips

Why PDFluent for this

Frequently asked questions

Related guides