From text to token: How tokenization pipelines work Hackernews Viewer

From text to token: How tokenization pipelines work

134 points by philippemnoel 11 December 2025 | 21 comments

Comments

heikkilevanto 12 December 2025

Good explanation on tokenizing English text for regular search. But it is far from universal, and will not work well in Finnish, for example.

Folding diacritics makes "vähä" (little) into "vaha" (wax).

Dropping stop words like "The" misses the word for "tea" (in rather old-fashioned finnish, but also in current Danish).

Stemming Finnish words is also much more complex, as we tend to append suffixes to the words instead of small words in front to the word. "talo" is "house", "talosta" is "from the house", "talostani" is "from my house", and "talostaniko" makes it a question "from my house?"

If that sounds too easy, consider Japanese. From what little I know they don't use whitespace to separate words, mix two phonetic alphabets with Chinese ideograms, etc.

wongarsu 12 December 2025

Notably tokenization for traditional search. LLMs use very different tokenization with very different goals

6r17 12 December 2025

I'm wondering if the english stopwords are not children of a forgotten declination that was forgotten from the language - ... ok so I had to check this out but I don't really have time to check more than with gemini - apparently - The word "the" is basically the sole survivor of a massive, complex table of declensions. In Old English, you could not just say "the." You had to choose the correct word based on gender, case, and number—exactly like you do in Polish today with ten, ta, to, tego, temu, tej, etc.

The Old English "The" (Definite Article) Case Masculine (Ten) Neuter (To) Feminine (Ta) Plural (Te) Nominative Se Þæt Sēo Þā Accusative Þone Þæt Þā Þā Genitive Þæs Þæs Þære Þāra Dative Þæm Þæm Þære Þæm Instrumental Þy Þy — —

I have read somewhere that polish was actually more precise language to be used with AI - I'm wondering if the idea of shortening words that apparently make no sense are not actually hurting it more - as noticed by the article though.

So I'm to wonder at this point - wouldn't it be worthy of exploring a tenser version of the language that might bridge that gap ? completely exploratory though I don't even know if that might be helpful idea other than being a toy

gortok 12 December 2025

My biggest complaints about search come from day-to-day uses:

I use search in my email pretty heavily, and I’m most interested in specific words in the email; and when those emails are from specific folks or a specific domain. But, the mobile version of Gmail produces different results than the mobile Outlook app than the desktop version of Gmail, and all of them are pretty terrible at search as it pertains to email.

I have a hard to getting them to pull up emails in search that I know exist, that I know have certain words, and I know have certain email addresses in the body.

I recognize a generalized searching mechanisms is going to get domain specific nuances wrong, but is it really so hard to make a search engine that works on email and email based attachments that no one cares enough to try?

flakiness 12 December 2025

Oh it's good old tokenization vs for-LLM tokenizations like sentence piece or tiktoken. We shouldn't forget there are non-ML simple things like this one which doesn't ask you to buy more GPUs.

semicognitive 12 December 2025

ParadeDB is a great team, highly recommend using

nawazgafar 12 December 2025

You beat me to the punch. I wrote a blog post[1] with the exact same title last week! Though, I went into a bit more detail with regard to embedding layers, so maybe my title is not accurate.

1. https://gafar.org/blog/text-to-tokens

the_arun 12 December 2025

Just curious - if we remove stop words from prompts before going to LLM, wouldn't it reduce token size? Will it keep the response from LLM same (original vs without stop tokens)?

anonymoushn 16 December 2025

i love stemming, i love searching for "anime" and getting "animal"

zk0 13 December 2025

tl;dr with a match statement