Skip to content

About Document Processing Benchmark

Document Processing Benchmark is a tool for comparing how different AI models and OCR engines extract text from documents. Upload a document, describe what you expect to find, and see how each contestant performs — scored, ranked, and compared side-by-side.

How It Works

1 Upload

You upload a document (image or PDF) and describe what text you expect to be extracted.

2 Process

Each contestant processes your document independently and in parallel. You can watch progress in real-time via streaming updates.

3 Score

An LLM evaluator scores each contestant's output (0–100) against your expectations, providing reasoning for each score.

4 Compare

Results appear on a leaderboard with aggregate statistics. You can drill into individual runs, view step-by-step processing details, and compare outputs side-by-side.

Scoring Methodology

Overall Score (0–100)

Each contestant output is scored by an LLM evaluator (currently Qwen 2.5 72B) against the user's expectations. The evaluator considers completeness, accuracy, formatting, and relevant detail extraction.

Scores are displayed as X.X/10 for readability.

Green: 8.0+ Yellow: 4.0–7.9 Red: below 4.0
Speed Index (0–100)

Measures how fast a contestant is relative to others in the field.

100 = fastest average duration, 0 = slowest.

100 × (1 - (avg_duration - min_avg) / (max_avg - min_avg))
Resource Index (0–100)

Estimates computational efficiency as a weighted composite:

  • Average processing duration: 40%
  • Model parameter count: 30%
  • Estimated API cost per call: 30%

Higher = more efficient. Missing factors are excluded and weights redistributed.

Contestants

Tesseract OCR

Open-source OCR engine via tesseract.js

Cost per call Free
Source View
Qwen Vision

Qwen 2.5 VL vision-language model for document understanding

Parameters 72B
Cost per call $0.0030
Source View
Gemma 4 26B

Google Gemma 4 26B vision-language model (MoE, 4B active)

Parameters 26B
Cost per call $0.0010
Source View
Gemma 4 31B

Google Gemma 4 31B vision-language model

Parameters 31B
Cost per call $0.0015
Source View
Gemma 3 12B

Google Gemma 3 12B vision-language model

Parameters 12B
Cost per call $0.0005
Source View
Gemma 3 4B

Google Gemma 3 4B vision-language model

Parameters 4B
Cost per call $0.0002
Source View
GPT-4o Mini

OpenAI GPT-4o Mini vision-language model

Parameters 8B
Cost per call $0.0002
Source View
Claude 3.5 Haiku

Anthropic Claude 3.5 Haiku vision-language model

Parameters 8B
Cost per call $0.0003
Source View
Llama 3.2 Vision 11B

Meta Llama 3.2 Vision 11B instruction-tuned model

Parameters 11B
Cost per call $0.0002
Source View
Pixtral 12B

Mistral Pixtral 12B vision-language model

Parameters 12B
Cost per call $0.0002
Source View

Supported Document Types

Images

PNG, JPEG, GIF, WebP

Documents

PDF (multi-page supported)

Max file size: 20 MB

Validation: Files are validated by magic bytes, not MIME type headers

Limitations & Caveats

  • Subjective scoring: Scores depend on an LLM evaluator which may have biases or inconsistencies.
  • Expectation quality matters: Vague expectations produce less meaningful scores.
  • Network variability: Duration measurements include network latency for cloud-based contestants.
  • No ground truth: Scores measure alignment with expectations, not absolute accuracy.
  • Contestant set: Currently limited to 10 contestants — more may be added over time.

FAQ

← Back to Leaderboard