About Document Processing Benchmark
Document Processing Benchmark is a tool for comparing how different AI models and OCR engines extract text from documents. Upload a document, describe what you expect to find, and see how each contestant performs — scored, ranked, and compared side-by-side.
How It Works
You upload a document (image or PDF) and describe what text you expect to be extracted.
Each contestant processes your document independently and in parallel. You can watch progress in real-time via streaming updates.
An LLM evaluator scores each contestant's output (0–100) against your expectations, providing reasoning for each score.
Results appear on a leaderboard with aggregate statistics. You can drill into individual runs, view step-by-step processing details, and compare outputs side-by-side.
Scoring Methodology
Each contestant output is scored by an LLM evaluator (currently Qwen 2.5 72B) against the user's expectations. The evaluator considers completeness, accuracy, formatting, and relevant detail extraction.
Scores are displayed as X.X/10 for readability.
Measures how fast a contestant is relative to others in the field.
100 = fastest average duration, 0 = slowest.
100 × (1 - (avg_duration - min_avg) / (max_avg - min_avg))Estimates computational efficiency as a weighted composite:
- Average processing duration: 40%
- Model parameter count: 30%
- Estimated API cost per call: 30%
Higher = more efficient. Missing factors are excluded and weights redistributed.
Contestants
Qwen 2.5 VL vision-language model for document understanding
Google Gemma 4 26B vision-language model (MoE, 4B active)
Google Gemma 4 31B vision-language model
Google Gemma 3 12B vision-language model
Google Gemma 3 4B vision-language model
OpenAI GPT-4o Mini vision-language model
Anthropic Claude 3.5 Haiku vision-language model
Meta Llama 3.2 Vision 11B instruction-tuned model
Mistral Pixtral 12B vision-language model
Supported Document Types
Images
PNG, JPEG, GIF, WebP
Documents
PDF (multi-page supported)
Max file size: 20 MB
Validation: Files are validated by magic bytes, not MIME type headers
Limitations & Caveats
- Subjective scoring: Scores depend on an LLM evaluator which may have biases or inconsistencies.
- Expectation quality matters: Vague expectations produce less meaningful scores.
- Network variability: Duration measurements include network latency for cloud-based contestants.
- No ground truth: Scores measure alignment with expectations, not absolute accuracy.
- Contestant set: Currently limited to 10 contestants — more may be added over time.