If you need this sort of thing in any other language, there's a free, no-auth, no-api-key-required, no-strings-attached API that can do this at https://jina.ai/reader/
I've actually used this and it's not perfect, there are websites (mostly those behind Cloudflare and other such proxies) that it can't handle, but it does 90% of the job, and is an one-liner in most languages with a decent HTTP requests library.
Great work. I thank you for it. I've used your library for a few years in a Lambda function which takes a URL and converts it to Markdown for storage in S3. I hooked it into every "bookmark" app I use as a webhook so I save a Markdown copy of everything I bookmark, which makes it very handy for importing into Obsidian.
I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there.
This is probably out of scope for your tool but it’d be nice to have built in n-gram deduplication where the tool strips any identical content from the header and footer, like navigation, when pointed at a few of these markdown files.
I have been looking for a similar lib to use in a Kotlin/Spring app - any recommendations? My specific use-case does not need to support sanitizing during the HTML -> MD conversion, as the HTML doc strings that I will be converting are sanitized during the extraction phase (using JSoup).
is there a plugin to convert the ast to json? Similar to the mistune package in python. I'm using this as part of a rag ingestion pipeline and working with markdown ast provides a flatter structure than raw html
Show HN: HTML-to-Markdown – convert entire websites to Markdown with Golang/CLI
(github.com)357 points by JohannesKauf 9 November 2024 | 47 comments
Comments
You just fetch a URL like `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and get a markdown document for the "inner" URL.
I've actually used this and it's not perfect, there are websites (mostly those behind Cloudflare and other such proxies) that it can't handle, but it does 90% of the job, and is an one-liner in most languages with a decent HTTP requests library.
http://www.cantoni.org/2019/01/27/converting-html-markdown-u...
I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there.
I built something similar for the Elixir world but it’s much more limited (I might borrow some of your ideas):
https://github.com/agoodway/html2markdown
If you also want to grab an accurate screenshot with the markdown of a webpage you can get both with Urlbox.
We have a couple of free tools that use this feature:
https://screenshotof.com https://url2text.com
https://farnots.github.io/RedditToMarkdown/
https://urltomarkdown.com/
Incredibly useful for leveraging LLMs and building AI apps.
Fresh tools and more choice is very welcome, thanks for your work!
are you using a separate library for that or did I miss something in this?