Show HN: HTML-to-Markdown – convert entire websites to Markdown with Golang/CLI Hackernews Viewer

Show HN: HTML-to-Markdown – convert entire websites to Markdown with Golang/CLI

357 points by JohannesKauf 9 November 2024 | 47 comments

Comments

miki123211 9 November 2024

If you need this sort of thing in any other language, there's a free, no-auth, no-api-key-required, no-strings-attached API that can do this at https://jina.ai/reader/

You just fetch a URL like `https://r.jina.ai/https://www.asimov.press/p/mitochondria`, and get a markdown document for the "inner" URL.

I've actually used this and it's not perfect, there are websites (mostly those behind Cloudflare and other such proxies) that it can't handle, but it does 90% of the job, and is an one-liner in most languages with a decent HTTP requests library.

NotACracker 9 November 2024

Pandoc

http://www.cantoni.org/2019/01/27/converting-html-markdown-u...

plaidwombat 9 November 2024

Great work. I thank you for it. I've used your library for a few years in a Lambda function which takes a URL and converts it to Markdown for storage in S3. I hooked it into every "bookmark" app I use as a webhook so I save a Markdown copy of everything I bookmark, which makes it very handy for importing into Obsidian.

rty32 9 November 2024

Nice! And glad to see it's MIT licensed.

I wonder if it is feasible to use this as a replacement for p2k, instapaper etc for the purpose of reading on Kindle. One annoyance with these services is that the rendering is way off -- h elements not showing up as headers, elements missing randomly, source code incorrectly rendered in 10 different ways. Some are better than others, but generally they are disappointing. (Yet they expect you to pay a subscription fee.) If this is an actively maintained project that welcomes contribution, I could test it out with various articles and report/fix issues. Although I wonder how much work there will be for handling edge cases of all the websites out there.

cpursley 9 November 2024

This is really nice, especially for feeding LLMs web page data (they generally understand markdown well).

I built something similar for the Elixir world but it’s much more limited (I might borrow some of your ideas):

https://github.com/agoodway/html2markdown

throwup238 9 November 2024

This is probably out of scope for your tool but it’d be nice to have built in n-gram deduplication where the tool strips any identical content from the header and footer, like navigation, when pointed at a few of these markdown files.

jot 9 November 2024

This is great!

If you also want to grab an accurate screenshot with the markdown of a webpage you can get both with Urlbox.

We have a couple of free tools that use this feature:

https://screenshotof.com https://url2text.com

paradite 9 November 2024

I have been using these two:

https://farnots.github.io/RedditToMarkdown/

https://urltomarkdown.com/

Incredibly useful for leveraging LLMs and building AI apps.

ssousa666 9 November 2024

I have been looking for a similar lib to use in a Kotlin/Spring app - any recommendations? My specific use-case does not need to support sanitizing during the HTML -> MD conversion, as the HTML doc strings that I will be converting are sanitized during the extraction phase (using JSoup).

sureglymop 9 November 2024

Reminds me of Aaron Swartz' html2text that I think serves the same purpose: http://www.aaronsw.com/2002/html2text/

juliuskiesian 9 November 2024

One of the pain points of using this kind of tools is handling syntax highlighted code blocks. How does html-to-markdown perform in such scenarios?

vergessenmir 10 November 2024

is there a plugin to convert the ast to json? Similar to the mistune package in python. I'm using this as part of a rag ingestion pipeline and working with markdown ast provides a flatter structure than raw html

Savageman 9 November 2024

I remember a long time ago I used Pandoc for this.

Fresh tools and more choice is very welcome, thanks for your work!

lollobomb 9 November 2024

This is nice, I tried a plugin for pandoc in the past but didn't really work well.

oezi 9 November 2024

Does it also include logic to download JS-driven sites properly or is this out of scope?

inhumantsar 9 November 2024

i've made some modest contributions to Mozilla's Readability library and didn't see anything like their heuristics in this.

are you using a separate library for that or did I miss something in this?

yayoohooyahoo 9 November 2024

Turndown works quite well too: https://github.com/mixmark-io/turndown

hello_computer 9 November 2024

This is honorable work. Thank you.

linhns 10 November 2024

Very neat tool. Well done!

lakomen 10 November 2024

Why?