Show HN: Benchmarking VLMs vs. Traditional OCR Hackernews Viewer

Show HN: Benchmarking VLMs vs. Traditional OCR

146 points by themanmaran 20 February 2025 | 40 comments

Comments

simonw 24 February 2025

The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.

I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.

blindriver 24 February 2025

As far as I'm concerned, all of these specialty services are dead compared to a generalized LLM like OpenAI or Gemini.

I wrote in a previous post about how NLP services were dead because of LLMs and obviously people in NLP took great offense to that. But I was able to use the NLP abilities of an LLM without needing to know anything about the intricacies of NLP or any APIs and it worked great. This post on OCR pretty much shows exactly what I meant. Gemini does OCR almost as good as OmniAI (granted I've never heard of it), but at 1/10th the cost. OpenAI will only get better very quickly. Kudos to OmniAI for releasing honest data, though.

Sure you might get an additional 5% accuracy from OmniAI vs Gemini but a generalized LLM can do so much more than just OCR. I've been playing with OpenAI this entire weekend and literally the sky's the limit. Not only can you OCR images, you can ask the LLM to summarize it, transform it into HTML, classify it, give a rating based on whatever parameters you want, get a lexile score, all in a single API call. Plus it will even spit out the code to do all of the above for you to use as well. And if it doesn't do what you need it to do right now, it will pretty soon.

I think the future of AI is going to be pretty bleak for everyone except the extremely big players that can afford to invest hundreds of billions of dollars. I also think there's going to be a real battle of copyright in less than 5 years which will also favor the big rich players as well.

jasonjmcghee 23 February 2025

A big takeaway for me is that Gemini Flash 2.0 is a great solution to OCR, considering accessibility, cost, accuracy, and speed.

It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.

Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.

gbertb 24 February 2025

How does this compare to Marker https://github.com/VikParuchuri/marker?

bn-l 23 February 2025

What is the privacy of the documents for the cloud service? There’s nothing in the privacy policy about data sent over the api.

EarlyOom 21 February 2025

OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.

betula_ai 21 February 2025

Thank you for sharing this. Some of the other public models that we can host ourselves may perform in practice better than the models listed - e.g. Qwen 2.5 VL https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file

alkh 24 February 2025

What is the best solution for recognizing handwritten text that combines multiple languages, especially in cases where certain letters look the same but represent different sounds? For example, the letter 'p' in English versus 'р' in Cyrillic languages, which sounds more like the English 'r'.

alok-g 24 February 2025

Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.

banditelol 24 February 2025

Anyone have tried comparing with Qwen VL based model? I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance

hasibzunair 27 February 2025

What kind of VLMs are being used in OmniAI?

I fine-tuned a Llama 3.2 Vision on a small dataset I created for extracting text without heavy cropping. Results are simply amazing in comparison with OCR-based approaches. It can be tried here: https://news.ycombinator.com/item?id=43192417

westurner 24 February 2025

Harmonic Loss converges more efficiently on MNIST OCR: https://github.com/KindXiaoming/grow-crystals .. "Harmonic Loss Trains Interpretable AI Models" (2025) https://news.ycombinator.com/item?id=42941954

danielcampos93 24 February 2025

GPT-4o as a judge to evaluate the quality of something which gpt4o is not inherently that good at. Red flag.

fzysingularity 20 February 2025

What VLMs do you use when you're listing OmniAI - is this mostly wrapping the model providers like your zerox repo?

jll29 24 February 2025

Does anyone have good experience with a particular pipeline for OCR-ing C source code?

default_ 27 February 2025

Wondering why Florence-2 is not on the list of models?