OCR4all Hackernews Viewer - by Brendan Jarvis

OCR4all

436 points by LorenDB 14 February 2025 | 124 comments

Comments

vintermann 14 February 2025

The big complicated segmentation pipeline is a legacy from the time you had to do that, a few years ago. It's error prone, and even at it's best it robs the model of valuable context. You need that context if you want to take the step to handwriting. If you go to a group of human experts to help you decipher historical handwriting, the first thing they will tell you is that they need the whole document for context, not just the line or word you're interested in.

We need to do end to end text recognition. Not "character recognition", it's not the characters we care about. Evaluating models with CER is also a bad idea. It frustrates me so much that text recognition is remaking all the mistakes of machine translation from 15+ years ago.

abrichr 14 February 2025

From https://www.ocr4all.org/guide/user-guide/introduction :

> OCR4all is a software which is primarily geared towards the digital text recovery and recognition of early modern prints, whose elaborate printing types and mostly uneven layout challenge the abilities of most standard text recognition software.

Looks like it's built on https://github.com/Calamari-OCR/calamari

seu 14 February 2025

Looks like a great project, and I don't want to nitpick, but...

https://www.ocr4all.org/about/ocr4all > Due to its comprehensible and intuitive handling OCR4all explicitly addresses the needs of non-technical users.

https://www.ocr4all.org/guide/setup-guide/quickstart > Quickstart > Open a terminal of your choice and enter the following command if you're running Linux (followed by a 6 line docker command).

How is that addressing the needs of non-technical users?

fny 14 February 2025

A little secret: Apple’s Vision Framework has an absurdly fast text recognition library with accuracy that beats Tesseract. It consumes almost any image format you can think of including PDFs.

I wrote a simple CLI tool and more featured Python wrapper for it: https://github.com/fny/swiftocr

mometsi 14 February 2025

> How is this different from tesseract and friends?

The workflow is for digitizing historical printed documents. Think conserving old announcements in blackletter typesetting, not extracting info from typewritten business documents.

jjuliano 14 February 2025

If you are interested, I also made an AI assisted OCR API - https://github.com/kdeps/examples

It combines Tesseract (for images) and Poppler-utils (PDF). A local open-source LLMs will extract document segments intelligently.

It can also be extended to use one or multiple Vision LLM models easily.

And finally, it outputs the entire AI agent API into a Dockerized container.

Krasnol 14 February 2025

> Designed with usability in mind

Create complex OCR workflows through the UI without the need of interacting with code or command line interfaces.

[...] https://www.ocr4all.org/guide/setup-guide/windows

------------------

I'm sorry. I suppose this is great but, an .exe-File is designed for usability. A docker container may be nice for techy people, but it is not "4all" this way and I do understand that the usability starts after you've gone through all the command line interface parts, but those are just extra steps compared to other OCR programs which work out of the box.

eigenvalue 14 February 2025

I think the current sweet-spot for speed/efficiency/accuracy is to use Tesseract in combination with an LLM to fix any errors and to improve formatting, as in my open source project which has been shared before as a Show HN:

https://github.com/Dicklesworthstone/llm_aided_ocr

This process also makes it extremely easy to tweak/customize simply by editing the English language prompt texts to prioritize aspects specific to your set of input documents.

jdthedisciple 14 February 2025

What is this? A new SOTA OCR engine (which would be very interesting to me) or just a tool that uses other known engines (which would be much less interesting to me).

A movement? A socio-political statement?

If only landing pages could be clearer about wtf it actually is ...

amai 14 February 2025

„OCR4all combines various open-source solutions to provide a fully automated workflow for automatic text recognition of historical printed (OCR) and handwritten (HTR) material.“

It seems to be based on OCR-D, which itself is based on

- https://github.com/tesseract-ocr/tesseract

- https://kraken.re/main/index.html

- https://github.com/ocropus-archive/DUP-ocropy

- https://github.com/Calamari-OCR/calamari

See

- https://ocr-d.de/en/models

It seems to be an open-source alternative to https://www.transkribus.org/ ( which uses amongst others https://atr.pages.teklia.com/pylaia/pylaia/ )

Another alternative is https://escriptorium.inria.fr/ ( which uses kraken)

jaffa2 14 February 2025

Ocr is well and good, i thought it was mostly solved with tesseract what does this bring? But, what I’m looking for is a reasonable library or usable implementation of MRC compression for the resulting pdfs. Nothing i have tried comes anywhere near the commercial offerings available, which cost $$$$ . It seems to be a tricky problem to solve, that is detecting and separating the layers of the image to compress separately and then binding them Back togethr into a compatible pdf.

miles 14 February 2025

As this project is geared toward "early modern prints", any recommendations for the best OCR/LLM solution for poor-quality typed manuscripts?

krick 14 February 2025

Wow. Setup took 12 GB of my disk. First impression: nice UI, but no idea what to do with it or how to create a project. Tells me "session expired" no matter what I try to do. Definitely not batteries-included kind of stuff, will need to explore later.

pogue 14 February 2025

I've been looking for a project that would have an easy free/extremely cheap way to do OCR/image recognition for generating ALT text automatically for social media. Some sort of embedded implementation that looks at an image and is either able to transcribe the text, or (preferably) transcribe the text AND do some brief image recognition.

I generally do this manually with Claude and it's able to do it lightning fast, but a small dev making a third party Bluesky/Mastodon/etc client doesn't have the resources to pay for an AI API.

cdrini 14 February 2025

What differentiates this from other tools? Eg tesseract, EasyOcr?

einpoklum 14 February 2025

They lost me when they suggested I install docker.

Now, I wouldn't mind if they suggested that as an _option_ for people whose system might exhibit compatibility problems, but - come on! How lazy can you get? You can't be bothered to cater to anything other than your own development environment, which you want us to reproduce? Then maybe call yourself "OCR4me", not "OCR4all".

alexnewman 14 February 2025

How does this compare quality wise to Gemini flash

registeredcorn 14 February 2025

I don't wish to speak out of turn, but it doesn't look like this project has been active for about 1 year. I checked GitHub and the last update was in Feb 2024. Their last post to X was 25 OCT 2023. :(

vagab0nd 14 February 2025

It's cool but is there any doubt that this will be very obsolete very soon? This is like how image recognition worked pre-CNN.

(It looks like the project started in 2022. So maybe it wasn't obvious at the time)

khaki54 14 February 2025

This looks promising, not sure how it stacks up to Transkribus which seems to be the leader in the space since it has support for handwritten and trainable ML for your dataset.

klaussilveira 14 February 2025

Training a model exclusively on written material from specific time periods would be a fascinating way to explore history, specially at schools.

axegon_ 14 February 2025

I've been using tesseract for a few years on a personal project, I'd be interested to know how they compare in terms of system resources, given that I am running it on a dell optiplex micro with 8 gigs of ram and 6-th gen i5 - tesseract is barely noticeable so it's just my curiosity at this point, I don't have any reasons to even consider switching over. I do however have a large dataset of several hundred gbs of scanned pdfs which would be worth digitalizing when I find some time to spare.