Hi, I'm a founder of Datalab. I'm not trying to take away from the launch (congrats), just wanted to respond to the specific feedback.
I'm glad you found a solution that worked for you, but this is pretty surprising to hear - our new model, chandra, saturates handwriting-heavy benchmarks like this one - https://www.datalab.to/blog/saturating-the-olmocr-benchmark ,and our production models are more performant than OSS.
Did you test some time ago? We've made a bunch of updates in the last couple of months. Happy to issue some credits if you ever want to try again - vik@datalab.to.
I assume you're using a PDF, and not the image you shared? You need to set force ocr or format lines to get inline math with a PDF (for images, we just OCR everything anyways, so you don't need any settings).
We're working on improving the playground generally now - expect a big update tomorrow, which among other things will default to format lines.
Thanks for the kind words! The team was just me until pretty recently, but we're growing quickly and will be addressing a lot of issues quickly in the next few weeks.
Perfect - it works!
Yes, I’m glad for all the time you’ve spent on this project: one of my ulterior goals is to make technical documentation for old systems and their programming environments accessible to LLMs, so that programming in retro computing can benefit from the advances in productivity that modern languages have. I’m sure you’ll find plenty of other user stories like that :)
For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.
Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.
I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?
We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:
- Every document has ground truth text, a JSON schema, and the ground truth JSON.
- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema
- Compare the predicted JSON against the ground truth JSON for accuracy.
In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.
Yup, surprising results! We were able to dig in a bit more. Main culprit is the overzealous "image extraction". Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002).
And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.
Wouldn't that just bias itself to the shape of the text extracted from the OCR against the shape of the raw text alone? It doesn't seem like it would be a great benchmark for estimating semantic accuracy?
Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.
I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.
There are a few different benchmark types in the marker repo:
- Heuristic (edit distance by block with an ordering score)
- LLM judging against a rubric
- LLM win rate (compare two samples from different providers)
None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.
I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.
It's just a FastAPI app with endpoints that I developed and deployed before OpenAI released structured outputs that used a custom grammar to enforce a pydantic-like schema for Chain of Thought rollouts / structured data extraction from unstructured text. I also use it for a video transcription knowledge base generation API
Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried
Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption
Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.
A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.
We had this article the other day[1] about how multiple LLMs can hallucinate about the same thing, so this is not guaranteed to remove hallucinations that are caused by poor or insufficient training data.
I don't see why any of that makes logical sense. These models require such enormous training data that they pretty much MUST use the same training data to a very large degree. The training data itself is what they spit out. So "hallucinations" are just the training data you get out, which is the entire point of the models in the first place. There is no difference between an hallucination and a correct answer from the perspective of the math.
Isn' it just statistical word pattern prediction based on training data? These models likely don't "know" something anyway and cannot verify "truth" and facts. Reasoning attempts seem to me basically just like looping until the model finds a self-satisfying equilibrium state with different output.
In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.
Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.
I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.
Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.
Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).
Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.
Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.
I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.
I used sxml [0] unironically in this project extensively.
The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.
Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?
Docling is a great project, happy to see more people building in the space.
Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:
- We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
- we run an ordering model, so ordering is better for docs where the PDF orde ris bad
- OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
- References and links
- Better equation conversion (soon including inline)
Hey, I'm the author of marker - thanks for sharing. Most of the processing time is model inference right now. I've been retraining some models lately onto new architectures to improve speed (layout, tables, LaTeX OCR).
We recently integrated gemini flash (via the --use_llm flag), which maybe moves us towards the "hybrid system" you mentioned. Hoping to add support for other APIs soon, but focusing on improving quality/speed now.
Happy to chat if anyone wants to talk about the difficulties of parsing PDFs, or has feedback - email in profile.
Very cool, any plans for a dockerized API of marker similar to what Unstructured released? I know you have a very attractively priced serverless offering (https://www.datalab.to) but having something to develop against locally would be great (for those of us not in the Python world).
We do meaningful research (we’ve trained several SoTA models), ship product, and contribute to open source. We’re hiring for 2 roles to help us scale:
Senior fullstack software engineer
- work across our open source repos, inference api, and frontend product
- interact with our user community
- you’ll need to be pragmatic, and embrace “boring” technology. Our stack is fastapi, pytorch, htmx, postgres, and redis. We deploy to render, and do inference with serverless gpus.
- requires having built something impressive, ideally an open source project
Head of business operations
- first non-technical hire
- work across multiple areas, including finance, hiring, and sales
- you’ll need to be extremely organized and able to get a lot done
- requires experience leading operations at an early stage
Email careers@datalab.to if you’re interested - include a link to something you’ve built if possible. You can also read more here - https://datalab-to.notion.site/careers-jan .
Hi, I'm the author of surya (https://github.com/VikParuchuri/surya) - working on improving speed and accuracy now. Happy to collaborate if you have specific page types it's not working on. For modern/clean documents it benchmarks very similarly to Google Cloud, but working on supporting older documents better now.
Hello Vik, and thanks for your work on Surya, I really liked it once I found it, but my main issue now is the latency and hardware requirements, as accuracy could be fixed overtime for different page types.
For example, I'm deploying tahweel to one of my webapps to allow limited number of users to run OCR on PDF files. I'm using a small CPU machine for this, deploying Surya will not be the same and I think you are facing similar issues in https://www.datalab.to.
Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.