TOON – Token Oriented Object Notation

(github.com)

Comments

inopinatus 22 hours ago
JSON unmarshalling often has to consider separately whether an attribute is absent, false, zero, null, or the empty string, but this was never quite semantically ambiguous enough for my tastes, so adding that void-ish values may also now be serialised as a tuple of length [0] seems to me an excellent additional obfuscation.
vessenes 27 October 2025
I’ll be interested to see benchmarks. My expectation is that accuracy will take a hit on mid or longer context prompts: I’d bet that the heavy use of JSON in fine tuning will end up impacting quality of a more terse (less reasoning space) novel encoding.

That said: I like the idea!

yohbho 3 hours ago
LLMs read

> users[2]{id,name,role}: 1,Alice,admin 2,Bob,user

differently than me, i guess. I would read that as "at index value of two, i.e. the third element of an array, the values 1aliceadmin and 2bobuser are stored, or not, since we want to destructure these values and a pair value of a tuple of three is given. and would be confused and think wtf is that, dear user, did you omit or misformat values?

pshirshov 5 hours ago
I have this: https://github.com/7mind/sick , a binary deduplicating storage for JSON-like data structures with efficient direct access (no parsing required). It's even more efficient in terms of space and access speed (but not manually editable).
neilv 12 hours ago
If you instead put parentheses around the lexical sequences, then you wouldn't need syntax like `[3]` to denote length.

You also wouldn't need indentation levels to be syntactically meaningful.

You could also get rid of LLM tokens like square brackets, curly braces, colons, and commas.

And you could have objects nested to arbitrary depth.

In near the same character count as TOON (sometimes more, sometimes less).

(I was telling someone over the weekend that there are only a few small wins for Lisps in most AI work right now. I hadn't considered that the printed syntax itself might have a use with these LLM huge black boxes.)

AvAn12 8 hours ago
Similar to YAML? Also do consider ancient formats like fixed width - in which case you don’t even need delimiter characters. Are LLMs clever enough to parse these if given a code book or old-school INPUT statement? Cheers
andreygrehov 18 hours ago
I don’t know what I’m talking about (pure fantasy), but what if you train a model on compressed data and then perform inference on compressed data as well? Could this work? With the output also being compressed and then decompressed by the client?
mentalgear 19 hours ago
Neat. I did a similar thing with CSV (instead of JSON) a year back. Great that there are measurements, but I think the really interesting measure would have it run against the actual "Structured Output Format" endpoints of LLM providers, e.g. those fine-tuned to return valid JSON.
rs186 15 hours ago
I wonder how many tokens will be saved compared to real JSON if we use a special version where property names don't require quotes, like in JavaScript.
anonymoushn 27 October 2025
Hello, it's probably better to add leading spaces before all of the words rather than none of them
hedgehog 20 hours ago
It would be interesting to compare this to BAML and TOML.
chuckadams 17 hours ago
indentation-based sounds pretty brittle for a serialization format. I imagine a tabular format that factors out repeating keys could be expressed fairly compactly in json itself.
awaseem 17 hours ago
This is awesome, I saw it on twitter and gave it a star
metalliqaz 17 hours ago
What is the font used on that README image?
3cats-in-a-coat 20 hours ago
I'll say the obvious. A lot of this you can just do in JSON.

Let's take the example:

    {
      "users": [
        { "id": 1, "name": "Alice", "role": "admin" },
        { "id": 2, "name": "Bob", "role": "user" }
      ]
    }

    users[2]{id,name,role}:
      1,Alice,admin
      2,Bob,user
We can keep it JSON, but use more compact list expressions, as tuples when pragmatic:

    ["users",
       [1, "Alice", "admin"],
       [2, "Bob", "user"]
    ]
The thing is the game with LLMs is not what's shortest, but what's:

1. Mainstream, so they understand it.

2. What they're tuned for, and their tuned for what's mainstream (JSON).

If you want to go extreme compression you can shove it all in JSON strings too and keep the larger structure JSON:

    ["users",
       "1:admin:Alice",
       "2:user:Bob",
    ]
You may say "how is this better". Well it's better because it's still JSON, there's less to explain to the LLM, and to your other devs. Even if we use a weird compact format like "id:role:name" this is still shorter to explain than a completely different syntax with its whole world of rules.
viggity 4 hours ago
Since the whole point of this is to limit LLM token consumption, it'd be interesting to see the results of prompts that use it.

I've seen a ton of people who just paste a CSV into a prompt and expect it to work well because they don't know any better, but the results are typically hot garbage. It's too repetitive, it can't memorize and/or process such a big chunk of data. Asking an LLM to use pandas to iteratively analyze some CSV works great, though.

meander_water 27 October 2025
I don't get it, can't you just use yaml instead of inventing another DSL.
Pxtl 21 hours ago
I'm sorry I don't see this adding value over various other formats. I don't really want a new object serialization format, I just want the existing ones to have the features I need. YAML but with static typing and schema. XML but without crazy internet features. TOML but with an object format that doesn't hurt my brain. JSON but with decent multiline strings and comments. NestedText but with a sub-standard that provides static-typing and schema and whatnot.
s1mon 18 hours ago
Obligatory XKCD: https://xkcd.com/927/
moralestapia 27 October 2025
[flagged]