> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.
> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.
Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.
> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.
Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.
My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
This measure of LLM capability could be extended by taking it into the 3D domain.
That is, having the model write Python code for Blender, then running blender in headless mode behind an API.
The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)
So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.
For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.
For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.
I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.
Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:
"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"
A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.
If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.
I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.
Does anyone have any thoughts on privacy/safety regarding what he said about GPT memory.
I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.
Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?
> most people find it difficult to remember the exact orientation of the frame.
Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?
I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.
The hilarious bit is that this page will soon be scraped by ai-bots as learning material, and they'll all learn to draw pelicans on bicycles using this as their primary example material, as they'll be the only examples.
So the only bird slightly resembling a pelican beak was drawn by gemini 2.5 pro. In general, none of the output resembles a pelican enough so you could separate it from "a bird".
OP seem to ignore that pelican has a distinct look when evaluating these doodles.
It's not so great at bicycles, either. None of those are close to rideable.
But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.
> If you lost interest in local models—like I did eight months ago—it’s worth paying attention to them again. They’ve got good now!
> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.
You reap what you sow....
> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.
Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.
> ... but prompt injection is still a thing.
...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.
> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.
People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.
Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings and this is great too! I like the personalized benchmark. Hopefully the big LLM providers don't start gaming the pelican index!
I think its hilarious how humans can make mistakes interpreting the crazy drawings : He says "I like how it solved the problem of pelicans not fitting on bicycles by adding a second smaller bicycle to the stack."
no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!
The last six months in LLMs, illustrated by pelicans on bicycles
(simonwillison.net)943 points by swyx 8 June 2025 | 232 comments
Comments
> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.
Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.
Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
This measure of LLM capability could be extended by taking it into the 3D domain.
That is, having the model write Python code for Blender, then running blender in headless mode behind an API.
The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)
So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.
For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.
For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.
I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.
``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```
But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.
It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.
I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).
https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo
https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7
https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro
Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.
(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"
A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.
I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.
I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.
Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?
Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?
I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.
The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...
https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...
GIGO in motion :-)
OP seem to ignore that pelican has a distinct look when evaluating these doodles.
But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.
> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.
You reap what you sow....
> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.
Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.
> ... but prompt injection is still a thing.
...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.
> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.
People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.
> Claude 4 will rat you out to the feds!
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.
Say what you want about Facebook but at least they released their flagship model fully open.
no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!