JSON unmarshalling often has to consider separately whether an attribute is absent, false, zero, null, or the empty string, but this was never quite semantically ambiguous enough for my tastes, so adding that void-ish values may also now be serialised as a tuple of length [0] seems to me an excellent additional obfuscation.
I’ll be interested to see benchmarks. My expectation is that accuracy will take a hit on mid or longer context prompts: I’d bet that the heavy use of JSON in fine tuning will end up impacting quality of a more terse (less reasoning space) novel encoding.
differently than me, i guess. I would read that as "at index value of two, i.e. the third element of an array, the values 1aliceadmin and 2bobuser are stored, or not, since we want to destructure these values and a pair value of a tuple of three is given. and would be confused and think wtf is that, dear user, did you omit or misformat values?
I have this: https://github.com/7mind/sick , a binary deduplicating storage for JSON-like data structures with efficient direct access (no parsing required). It's even more efficient in terms of space and access speed (but not manually editable).
If you instead put parentheses around the lexical sequences, then you wouldn't need syntax like `[3]` to denote length.
You also wouldn't need indentation levels to be syntactically meaningful.
You could also get rid of LLM tokens like square brackets, curly braces, colons, and commas.
And you could have objects nested to arbitrary depth.
In near the same character count as TOON (sometimes more, sometimes less).
(I was telling someone over the weekend that there are only a few small wins for Lisps in most AI work right now. I hadn't considered that the printed syntax itself might have a use with these LLM huge black boxes.)
Similar to YAML? Also do consider ancient formats like fixed width - in which case you don’t even need delimiter characters. Are LLMs clever enough to parse these if given a code book or old-school INPUT statement? Cheers
I don’t know what I’m talking about (pure fantasy), but what if you train a model on compressed data and then perform inference on compressed data as well? Could this work? With the output also being compressed and then decompressed by the client?
Neat. I did a similar thing with CSV (instead of JSON) a year back.
Great that there are measurements, but I think the really interesting measure would have it run against the actual "Structured Output Format" endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.
I wonder how many tokens will be saved compared to real JSON if we use a special version where property names don't require quotes, like in JavaScript.
indentation-based sounds pretty brittle for a serialization format. I imagine a tabular format that factors out repeating keys could be expressed fairly compactly in json itself.
The thing is the game with LLMs is not what's shortest, but what's:
1. Mainstream, so they understand it.
2. What they're tuned for, and their tuned for what's mainstream (JSON).
If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:
["users",
"1:admin:Alice",
"2:user:Bob",
]
You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.
Since the whole point of this is to limit LLM token consumption, it'd be interesting to see the results of prompts that use it.
I've seen a ton of people who just paste a CSV into a prompt and expect it to work well because they don't know any better, but the results are typically hot garbage. It's too repetitive, it can't memorize and/or process such a big chunk of data. Asking an LLM to use pandas to iteratively analyze some CSV works great, though.
I'm sorry I don't see this adding value over various other formats. I don't really want a new object serialization format, I just want the existing ones to have the features I need. YAML but with static typing and schema. XML but without crazy internet features. TOML but with an object format that doesn't hurt my brain. JSON but with decent multiline strings and comments. NestedText but with a sub-standard that provides static-typing and schema and whatnot.
TOON – Token Oriented Object Notation
(github.com)155 points by royosherove 26 October 2025 | 57 comments
Comments
That said: I like the idea!
> users[2]{id,name,role}: 1,Alice,admin 2,Bob,user
differently than me, i guess. I would read that as "at index value of two, i.e. the third element of an array, the values 1aliceadmin and 2bobuser are stored, or not, since we want to destructure these values and a pair value of a tuple of three is given. and would be confused and think wtf is that, dear user, did you omit or misformat values?
You also wouldn't need indentation levels to be syntactically meaningful.
You could also get rid of LLM tokens like square brackets, curly braces, colons, and commas.
And you could have objects nested to arbitrary depth.
In near the same character count as TOON (sometimes more, sometimes less).
(I was telling someone over the weekend that there are only a few small wins for Lisps in most AI work right now. I hadn't considered that the printed syntax itself might have a use with these LLM huge black boxes.)
Let's take the example:
We can keep it JSON, but use more compact list expressions, as tuples when pragmatic: The thing is the game with LLMs is not what's shortest, but what's:1. Mainstream, so they understand it.
2. What they're tuned for, and their tuned for what's mainstream (JSON).
If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:
You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.I've seen a ton of people who just paste a CSV into a prompt and expect it to work well because they don't know any better, but the results are typically hot garbage. It's too repetitive, it can't memorize and/or process such a big chunk of data. Asking an LLM to use pandas to iteratively analyze some CSV works great, though.