I see that you're looking for clusters within PCA projections -- You should look for deeper structure with hot new dimensional reduction algorithms, like PaCMAP or LocalMAP!
I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!
A point of note is that the text embeddings model used here is paraphrase-multilingual-MiniLM-L12-v2 (https://huggingface.co/sentence-transformers/paraphrase-mult...), which is about 4 years old. In the NLP world, that's effectively ancient, particularly as the robustness of even small embeddings models due to global LLM improvements has increased dramatically both in information representation and distinctiveness in the embedding space. Even modern text embedding models not explicitly trained for multilingual support still do extremely well on that type of data, so they may work better for the Voynich Manuscript which is a relatively unknown language.
The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.
Does it make sense to check the process with a control group?
E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?
I had a look at the manuscript for a while and found it suspicious how tightly packed the writing was against the illustrations on some pages. In common language words and letters vary in width, so when you approach the end of the line when writing, you naturally insert a break to begin a new word and avoid overrun. The manuscript is missing these kinds of breaks - I saw many places where it looked like whatever letter might squeeze in had been written at the end of the line.
I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.
My completely amateur take is that it's an elaborate piece of art or hoax.
I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.
You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...
Maybe I missed it in the README but how did you do the initial encoding for the "words"? so for example, if you have ""okeeodair" as a word, where do you map that back to original symbols?
The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.
I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.
Being from the 15th Century the obvious reason to encrypt text was to avoid religious persecution during "The Inquisition" (and other religion-motivated violence of that time). So it would be interesting to run the same NLP against the Gospels and look for correlations with that. You'd want to first do a 'word'-based comparison, and then a 'character'-based comparison. I mean compare the graphs from Bible to graphs from Voynich.
Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.
what I'd expect from a handwritten book like that, if it is just a gibberish, and not a cypher of any sorts - the style, calligraphy, the words used, even letters themselves should evolve from page 1 to the last page. Pages could be reordered of course, but it still should be noticeable.
Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.
I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.
> Traditional analyses often fall into two camps: statistical entropy checks or wild guesswork.
I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.
For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.
There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.
There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.
Would analysis of a similar body of text in a known language yield similar patterns? Put it in another way, could you use this type of an analysis on different types of text help understand what this script describes?
Really cool work here. Have you considered applying these same techniques to the Rohonc Codex? As far as I know, the only other book similar to the Voynich Manuscript.
Although I skimmed the methodology out of curiosity, what really drew my eye was the transcription in the repository of the manuscript. This led me down a rabbit hole leading here [1] about historic efforts to transcript or transliterate the manuscript.
How expensive is a "brute force" approach to decode it? I mean, how about mapping each unknown word by a known word in a known language and improve this mapping until a 'high score' is reached?
Sorry if I missed it, but what about keeping the suffixes and trying to do some finetuning on the source then clustering sentences or at least pages which given the media should be consistent-ish
I strongly believe the manuscript is undecipherable in the sense thats it's all gibberish. I can't prove it, but at this point I think it's more likely than not to be hoax.
I have no background in NLP or linguistics, but I do have a question about this:
> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants
This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?
The best work on Voynich has been done by Emma Smith, Coons and Patrick Feaster, about loops and QOKEDAR and CHOLDAIIN cycles. Here's a good presentation: https://www.youtube.com/watch?v=SCWJzTX6y9M Zattera and Roe have also done good work on the "slot alphabet". That so many are making progression in the same direction is quite encouraging!
> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA
That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.
In short, the manuscript looks like a genuine text, not like a random bunch of characters pretending to be a text.
<quote>
Key Findings
* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
* Cluster 3 has high diversity and flexible positioning — likely a root content class
* Transition matrix shows strong internal structure, far from random
* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
Hypothesis
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
Show HN: I modeled the Voynich Manuscript with SBERT to test for structure
(github.com)381 points by brig90 18 May 2025 | 131 comments
Comments
I've been working on a project related to a sensemaking tool called Pol.is [1], but reprojecting its wiki survey data with these new algorithms instead of PCA, and it's amazing what new insight it uncovers with these new algorithms!
https://patcon.github.io/polislike-opinion-map-painting/
Painted groups: https://t.co/734qNlMdeh
(Sorry, only really works on desktop)
[1]: https://www.technologyreview.com/2025/04/15/1115125/a-small-...
The traditional NLP techniques of stripping suffices and POS identification may actually harm embedding quality than improvement, since that removes relevant contextual data from the global embedding.
Does it make sense to check the process with a control group?
E.g. if we ask a human to write something that resembles a language but isn’t, then conduct this process (remove suffixes, attempt grouping, etc), are we likely to get similar results?
I wanted to do an analysis of what letters occur just before/after a line break to see if there is a difference from the rest of the text, but couldn't find a transcribed version.
My completely amateur take is that it's an elaborate piece of art or hoax.
I'm not familiar with SBERT, or with modern statistical NLP in general, but SBERT works on sentences, and there are no obvious sentence delimiters in the Voynich Manuscript (only word and paragraph delimiters). One concern I have is "Strips common suffixes from Voynich words". Words in the Voynich Manuscript appear to be prefix + suffix, so as prefixes are quite short, you've lost roughly half the information before commencing your analysis.
You might want to verify that your method works for meaningful text in a natural language, and also for meaningless gibberish (encrypted text is somewhere in between, with simpler encryption methods closer to natural language and more complex ones to meaningless gibberish). Gordon Rugg, Torsten Timm, and myself have produced text which closely resembles the Voynich Manuscript by different methods. Mine is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.h... and the equivalent EVA is here: https://fmjlang.co.uk/voynich/generated-voynich-manuscript.t...
Reference mapping each cluster to all the others would be a nice way to indicate that there's no variability left in your analysis
The author made an assumption that Voynichese is a Germanic language, and it looks like he was able to make some progress with it.
I’ve also come across accounts that it might be an Uralic or Finno-Ugric language. I think your approach is great, and I wonder if tweaking it for specific language families could go even further.
Also there might be some characters that are in there just to confuse. For example that bizarre capital "P"-like thing that has multiple variations seems to appear sometimes far too often to represent real language, so it might be just an obfuscator that's removed prior to decryption. There may be other characters that are abnormally "frequent" and they're maybe also unused dummy characters. But the "too many Ps" problem is also consistent with just pure fiction too, I realize.
Unless author hadn't written tens of books exactly like that before, which didn't survive, of course.
I don't think it's a very novel idea, but I wonder if there's analysis for pattern like that. I haven't seen mentions of page to page consistency anywhere.
I'd argue that these are just the camps that non-traditional, amateur analysis efforts fall into. I've only briefly skimmed Voynich work, but my impression is that, traditionally, more academic analyses rely on a combination of linguistic and cryptological analysis. This does happen to be informed by some statistical analysis, but goes way beyond that.
For example, as I recall the strongest argument that Voynichese probably isn't just an alternative alphabet for a well-known language relies on comparing Voynichese to the general patterns for how writing systems map symbols to sounds. That permits the development of more specific hypotheses about how it could possibly function, including how likely it is to be an alphabet or abjad, and, hypotheses about which characters could plausibly represent more than one sound, possible digraphs, etc. All of that work casts severe doubt on the likelihood of it representing a language from the area because it just can't plausibly represent a language with the kinds of phonological inventories we see in the language families that existed in that place and time.
There's also been some pretty interesting work on identifying individual scribes based on a confluence of factors including, but not limited to, analysis of the text itself. Some of the inferred scribes exclusively wrote in the A language (oh yeah, Voynichese seems to contain two distinct "languages"), some exclusively wrote in the B language, I think they've even hypothesized that there's one who actually used both languages.
There isn't a lot of popular awareness of this work because it's not terribly sexy to anyone but a linguistics nerd. But I'd guess that any attempt to poke at the Voynich manuscript that isn't informed by it is operating at a severe disadvantage. You want to be standing on the shoulders of the tallest giants, not the ones with the best social media presence.
[1] https://www.voynich.nu/transcr.html
https://arstechnica.com/science/2024/09/new-multispectral-an...
but imagine if it was just a (wealthy) child's coloring book or practice book for learning to write lol
https://www.youtube.com/watch?v=p6keMgLmFEk&t=1s
I have no background in NLP or linguistics, but I do have a question about this:
> I stripped a set of recurring suffix-like endings from each word — things like aiin, dy, chy, and similar variants
This seems to imply stripping the right-hand edges of words, with the assumption that the text was written left to right? Or did you try both possibilities?
Once again, nice work.
https://www.voynich.ninja/thread-4327-post-60796.html#pid607... is the main forum discussing precisely this. I quite liked this explanation of the apparent structure: https://www.voynich.ninja/thread-4286.html
> RU SSUK UKIA UK SSIAKRAINE IARAIN RA AINE RUK UKRU KRIA UKUSSIA IARUK RUSSUK RUSSAINE RUAINERU RUKIA
That is, there may be 2 "word types" with different statistical properties (as Feaster's video above describes)(perhaps e.g. 2 different Cyphers used "randomly" next to each other). Figuring out how to imitate the MS' statistical properties would let us determine cypher system and make steps towards determining its language etc. so most credible work's gone in this direction over the last 10+ years.
This site is a great introduction/deep dive: https://www.voynich.nu/
<quote>
Key Findings
* Cluster 8 exhibits high frequency, low diversity, and frequent line-starts — likely a function word group
* Cluster 3 has high diversity and flexible positioning — likely a root content class
* Transition matrix shows strong internal structure, far from random
* Cluster usage and POS patterns differ by manuscript section (e.g., Biological vs Botanical)
Hypothesis
The manuscript encodes a structured constructed or mnemonic language using syllabic padding and positional repetition. It exhibits syntax, function/content separation, and section-aware linguistic shifts — even in the absence of direct translation.
</quote>
https://www.researchgate.net/publication/368991190_The_Voyni...