The benchmark I most want to see around OCR is one that covers risks from accidental (or deliberate) prompt injection - I want to know how likely it is that a model might OCR a page and then accidentally act on instructions in that content rather than straight transcribing it as text.
I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.
As far as I'm concerned, all of these specialty services are dead compared to a generalized LLM like OpenAI or Gemini.
I wrote in a previous post about how NLP services were dead because of LLMs and obviously people in NLP took great offense to that. But I was able to use the NLP abilities of an LLM without needing to know anything about the intricacies of NLP or any APIs and it worked great. This post on OCR pretty much shows exactly what I meant. Gemini does OCR almost as good as OmniAI (granted I've never heard of it), but at 1/10th the cost. OpenAI will only get better very quickly. Kudos to OmniAI for releasing honest data, though.
Sure you might get an additional 5% accuracy from OmniAI vs Gemini but a generalized LLM can do so much more than just OCR. I've been playing with OpenAI this entire weekend and literally the sky's the limit. Not only can you OCR images, you can ask the LLM to summarize it, transform it into HTML, classify it, give a rating based on whatever parameters you want, get a lexile score, all in a single API call. Plus it will even spit out the code to do all of the above for you to use as well. And if it doesn't do what you need it to do right now, it will pretty soon.
I think the future of AI is going to be pretty bleak for everyone except the extremely big players that can afford to invest hundreds of billions of dollars. I also think there's going to be a real battle of copyright in less than 5 years which will also favor the big rich players as well.
OCR seems to be mostly solved for 'normal' text laid out according to Latin alphabet norms (left to right, normal spacing etc.), but would love to see more adversarial examples. We've seen lots of regressions around faxed or scanned documents where the text boxes may be slightly rotated (e.g. https://www.cad-notes.com/autocad-tip-rotate-multiple-texts-...) not to mention handwriting and poorly scanned docs. Then there's contextually dependent information like X-axis labels that are implicit from a legend somewhere, so its not clear even with the bounding boxes what the numbers refer to. This is where VLMs really shine: they can extract text then use similar examples from the page to map them into their output values when the bounding box doesn't provide this for free.
What is the best solution for recognizing handwritten text that combines multiple languages, especially in cases where certain letters look the same but represent different sounds? For example, the letter 'p' in English versus 'р' in Cyrillic languages, which sounds more like the English 'r'.
Looking at the sample documents, this seems more focused on tables and structured data extraction and not long-form texts. The ground truth JSON has so much less information than the original document image. I would love to see a similar benchmark for full contents including long-form text and tables.
Anyone have tried comparing with Qwen VL based model?
I heard good things about its performance on ocr compared to other self hostable model, but haven't really tried benchmarking its performance
I fine-tuned a Llama 3.2 Vision on a small dataset I created for extracting text without heavy cropping. Results are simply amazing in comparison with OCR-based approaches. It can be tried here: https://news.ycombinator.com/item?id=43192417
Show HN: Benchmarking VLMs vs. Traditional OCR
(getomni.ai)146 points by themanmaran 20 February 2025 | 40 comments
Comments
I'm interested in the same thing for audio transcription too, for models like Gemini or GPT-4o audio accepting audio input.
I wrote in a previous post about how NLP services were dead because of LLMs and obviously people in NLP took great offense to that. But I was able to use the NLP abilities of an LLM without needing to know anything about the intricacies of NLP or any APIs and it worked great. This post on OCR pretty much shows exactly what I meant. Gemini does OCR almost as good as OmniAI (granted I've never heard of it), but at 1/10th the cost. OpenAI will only get better very quickly. Kudos to OmniAI for releasing honest data, though.
Sure you might get an additional 5% accuracy from OmniAI vs Gemini but a generalized LLM can do so much more than just OCR. I've been playing with OpenAI this entire weekend and literally the sky's the limit. Not only can you OCR images, you can ask the LLM to summarize it, transform it into HTML, classify it, give a rating based on whatever parameters you want, get a lexile score, all in a single API call. Plus it will even spit out the code to do all of the above for you to use as well. And if it doesn't do what you need it to do right now, it will pretty soon.
I think the future of AI is going to be pretty bleak for everyone except the extremely big players that can afford to invest hundreds of billions of dollars. I also think there's going to be a real battle of copyright in less than 5 years which will also favor the big rich players as well.
It also has a 1M token context window, though from personal experience it seems to work better the smaller the context window is.
Seems like Google models have been slowly improving. It wasn't so long ago I completely dismissed them.
I fine-tuned a Llama 3.2 Vision on a small dataset I created for extracting text without heavy cropping. Results are simply amazing in comparison with OCR-based approaches. It can be tried here: https://news.ycombinator.com/item?id=43192417