Hacker Newsnew | past | comments | ask | show | jobs | submit | vikp's commentslogin

Hey, I'm the founder of Datalab (we released Chandra OCR). I see someone requested it below - happy to help you all get setup. I'm vik@datalab.to


Hi, I'm a founder of Datalab. I'm not trying to take away from the launch (congrats), just wanted to respond to the specific feedback.

I'm glad you found a solution that worked for you, but this is pretty surprising to hear - our new model, chandra, saturates handwriting-heavy benchmarks like this one - https://www.datalab.to/blog/saturating-the-olmocr-benchmark ,and our production models are more performant than OSS.

Did you test some time ago? We've made a bunch of updates in the last couple of months. Happy to issue some credits if you ever want to try again - vik@datalab.to.


Thanks, Vik. Happy to try the model again. Is BAA available?


Yes, we can sign a BAA!


Hi, author of marker here - I tried your image, and I don't see the issues you're describing with the newest version of marker (1.7.5).

I ran both with no setting specified, and with force_ocr, and I didn't see the issues either time.


Hi there - thanks for getting back to me. I do genuinely want this workflow to work - Marker has been very useful for other purposes for me!

I’m currently using the Datalab online playground with default settings - does that enable inline math recognition?


I assume you're using a PDF, and not the image you shared? You need to set force ocr or format lines to get inline math with a PDF (for images, we just OCR everything anyways, so you don't need any settings).

We're working on improving the playground generally now - expect a big update tomorrow, which among other things will default to format lines.

Thanks for the kind words! The team was just me until pretty recently, but we're growing quickly and will be addressing a lot of issues quickly in the next few weeks.


Perfect - it works! Yes, I’m glad for all the time you’ve spent on this project: one of my ulterior goals is to make technical documentation for old systems and their programming environments accessible to LLMs, so that programming in retro computing can benefit from the advances in productivity that modern languages have. I’m sure you’ll find plenty of other user stories like that :)


I ran a partial benchmark against marker - https://github.com/VikParuchuri/marker .

Across 375 samples with LLM as a judge, mistral scores 4.32, and marker 4.41 . Marker can inference between 20 and 120 pages per second on an H100.

You can see the samples here - https://huggingface.co/datasets/datalab-to/marker_comparison... .

The code for the benchmark is here - https://github.com/VikParuchuri/marker/tree/master/benchmark... . Will run a full benchmark soon.

Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.


> with LLM as a judge

For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...


We also ran an OCR benchmark with LLM as judge using structured outputs. You can check out the full methodology on the repo [1]. But the general idea is:

- Every document has ground truth text, a JSON schema, and the ground truth JSON.

- Run OCR on each document and pass the result to GPT-4o along with the JSON Schema

- Compare the predicted JSON against the ground truth JSON for accuracy.

In our benchmark, the ground truth text => gpt-4o was 99.7%+ accuracy. Meaning whenever gpt-4o was given the correct text, it could extract the structured JSON values ~100% of the time. So if we pass in the OCR text from Mistral and it scores 70%, that means the inaccuracies are isolated to OCR errors.

https://github.com/getomni-ai/benchmark


were you guys able to finish running the benchmark with mistral and got a 70% score? Missed that

Edit - I see it on the Benchmark page now. Woof, low 70% scores in some areas!

https://getomni.ai/ocr-benchmark


Yup, surprising results! We were able to dig in a bit more. Main culprit is the overzealous "image extraction". Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002).

And it happened with a lot of full documents as well. Ex: most receipts got classified as images, and so it didn't extract any text.


This sounds like a real problem and hurdle for North American (US/CAN in particular) invoice and receipt processing?


where do you find this regarding "Where if Mistral classifies something as an image, it will replace the entire section with (image)[image_002)."?


themanmaran works at Omni so presumably they have access to the actual resulting data from this study


Wouldn't that just bias itself to the shape of the text extracted from the OCR against the shape of the raw text alone? It doesn't seem like it would be a great benchmark for estimating semantic accuracy?


Benchmarking is hard for markdown because of the slight formatting variations between different providers. With HTML, you can use something like TEDS (although there are issues with this, too), but with markdown, you don't have a great notion of structure, so you're left with edit distance.

I think blockwise edit distance is better than full page (find the ground truth blocks, then infer each block separately and compare), but many providers only do well on full pages, which doesn't make it fair.

There are a few different benchmark types in the marker repo:

  - Heuristic (edit distance by block with an ordering score)
  - LLM judging against a rubric
  - LLM win rate (compare two samples from different providers)
None of these are perfect, but LLM against a rubric has matched visual inspection the best so far.

I'll continue to iterate on the benchmarks. It may be possible to do a TEDS-like metric for markdown. Training a model on the output and then benchmarking could also be interesting, but it gets away from measuring pure extraction quality (the model benchmarking better is only somewhat correlated with better parse quality). I haven't seen any great benchmarking of markdown quality, even at research labs - it's an open problem.


You can use structured outputs, or something like my https://arthurcolle--dynamic-schema.modal.run/

to extract real data from unstructured text (like that producted from an LLM) to make benchmarks slightly easier if you have a schema


What is the project? It just returns a vanilla html page saying:

Dynamic Schema API API is running. See documentation for available endpoints.


It's just a FastAPI app with endpoints that I developed and deployed before OpenAI released structured outputs that used a custom grammar to enforce a pydantic-like schema for Chain of Thought rollouts / structured data extraction from unstructured text. I also use it for a video transcription knowledge base generation API

https://arthurcolle--dynamic-schema.modal.run/docs


Thank you for your work on Marker. It is the best OCR for PDFs I’ve found. The markdown conversion can get wonky with tables, but it still does better than anything else I’ve tried


Thanks for sharing! I'm training some models now that will hopefully improve this and more :)


LLM as a judge?

Isn't that a potential issue? You are assuming the LLM judge is reliable. What evidence do you have to assure yourself or/and others that it is reasonable assumption


Perhaps they already evaluated their LLM judge model (with another LLM)


This is awesome. Have you seen / heard of any benchmarks where the data is actually a structured JSON vs. markdown?


Thanks for the tip. Marker solved a table conversion without LLM that docling wasn't able to solve.


Really interesting benchmark, thanks for sharing! It's good to see some real-world comparisons. The hallucinations issue is definitely a key concern with LLM-based OCR, and it's important to quantify that risk. Looking forward to seeing the full benchmark results.


>Mistral OCR is an impressive model, but OCR is a hard problem, and there is a significant risk of hallucinations/missing text with LLMs.

To fight hallucinations, can't we use more LLMs and pick blocks where the majority of LLMs agree?


Why wouldn't hallucinations be agreed upon if they have roughly the same training data?


A hallucination is often an indication that the model doesn't know something. Then, the internal signal gets dominated by noise from the seeded training weights. Efforts to eliminate hallucinations with a single model have found success by asking the same question in different ways and only taking answers that agree. Logically, you could get more durable results from multiple models on the same prompt.


We had this article the other day[1] about how multiple LLMs can hallucinate about the same thing, so this is not guaranteed to remove hallucinations that are caused by poor or insufficient training data.

[1] https://news.ycombinator.com/item?id=43222027


I don't see why any of that makes logical sense. These models require such enormous training data that they pretty much MUST use the same training data to a very large degree. The training data itself is what they spit out. So "hallucinations" are just the training data you get out, which is the entire point of the models in the first place. There is no difference between an hallucination and a correct answer from the perspective of the math.


Isn' it just statistical word pattern prediction based on training data? These models likely don't "know" something anyway and cannot verify "truth" and facts. Reasoning attempts seem to me basically just like looping until the model finds a self-satisfying equilibrium state with different output.

In that way, LLMs are more human than, say, a database or a book containing agreed-upon factual information which can be directly queried on demand.

Imagine if there was just ONE human with human limitations on the entire planet who was taught everything for a long time - how reliable do you think they are with information retrieval? Even highly trained individuals (e.g. professors) can get stuff wrong on their specific topics at times. But this is not what we expect and demand from computers.


I like the licensing options! Hopefully they make enough money to fund development.


I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.

Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.

Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).

Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .

You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .

Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.


Are you also a fan of the Dallas Cowboys?


Marker (https://www.github.com/VikParuchuri/marker) works kind of like this. It uses a layout model to identify blocks and processes each one separately. The internal format is a tree of blocks, which have arbitrary fields, but can all render to html. It can write out to json, html, or markdown.

I integrated gemini recently to improve accuracy in certain blocks like tables. (get initial text, then pass to gemini to refine) Marker alone works about as well as gemini alone, but together they benchmark much better.


I used sxml [0] unironically in this project extensively.

The rendering step for reports that humans got to see was a call to pandoc after the sxml was rendered to markdown - look ma we support powerpoint! - but it also allowed us to easily convert to whatever insane markup a given large (or small) language model worked best with on the fly.

[0] https://en.wikipedia.org/wiki/SXML


Why process separately, if there are ink smudges, photocopier glitches, etc. wouldn't it guess some stuff better from richer context, like acronyms in rows used across the other tables?


It's funny you astroturf your own project in a thread where another is presenting tangential info about their own


what does marker add on top of docling?


Docling is a great project, happy to see more people building in the space.

Marker output will be higher quality than docling output across most doc types, especially with the --use_llm flag. A few specific things we do differently:

  - We have hybrid mode with gemini that merges tables across pages, improves quality on forms, etc.
  - we run an ordering model, so ordering is better for docs where the PDF orde ris bad
  - OCR is a lot better, we train our own model, surya - https://github.com/VikParuchuri/surya
  - References and links
  - Better equation conversion (soon including inline)


Hey, I'm the author of marker - thanks for sharing. Most of the processing time is model inference right now. I've been retraining some models lately onto new architectures to improve speed (layout, tables, LaTeX OCR).

We recently integrated gemini flash (via the --use_llm flag), which maybe moves us towards the "hybrid system" you mentioned. Hoping to add support for other APIs soon, but focusing on improving quality/speed now.

Happy to chat if anyone wants to talk about the difficulties of parsing PDFs, or has feedback - email in profile.


Very cool, any plans for a dockerized API of marker similar to what Unstructured released? I know you have a very attractively priced serverless offering (https://www.datalab.to) but having something to develop against locally would be great (for those of us not in the Python world).


It's on the list to build - been focusing on quality pretty heavily lately.


Datalab | NYC | Full-time | Software Engineer and Head of Business Ops | $250k-$350k + 1.5-3% equity | https://www.datalab.to

A significant % of useful data is locked away in tough-to-parse formats like PDFs. We build tools to extract it, like https://github.com/VikParuchuri/surya (15k Github stars), and https://github.com/VikParuchuri/marker (19k stars). We also run an inference API and product.

We do meaningful research (we’ve trained several SoTA models), ship product, and contribute to open source. We’re hiring for 2 roles to help us scale:

Senior fullstack software engineer

  - work across our open source repos, inference api, and frontend product
  - interact with our user community
  - you’ll need to be pragmatic, and embrace “boring” technology.  Our stack is fastapi, pytorch, htmx, postgres, and redis.  We deploy to render, and do inference with serverless gpus.
  - requires having built something impressive, ideally an open source project
Head of business operations

  - first non-technical hire
  - work across multiple areas, including finance, hiring, and sales
  - you’ll need to be extremely organized and able to get a lot done
  - requires experience leading operations at an early stage
Email careers@datalab.to if you’re interested - include a link to something you’ve built if possible. You can also read more here - https://datalab-to.notion.site/careers-jan .


Hi, I'm the author of surya (https://github.com/VikParuchuri/surya) - working on improving speed and accuracy now. Happy to collaborate if you have specific page types it's not working on. For modern/clean documents it benchmarks very similarly to Google Cloud, but working on supporting older documents better now.


Hello Vik, and thanks for your work on Surya, I really liked it once I found it, but my main issue now is the latency and hardware requirements, as accuracy could be fixed overtime for different page types.

For example, I'm deploying tahweel to one of my webapps to allow limited number of users to run OCR on PDF files. I'm using a small CPU machine for this, deploying Surya will not be the same and I think you are facing similar issues in https://www.datalab.to.


It seems to struggle with German text a lot (umlauts etc)


Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: