The last six months in LLMs, illustrated by pelicans on bicycles

(simonwillison.net)

Comments

isx726552 8 June 2025
> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

adrian17 8 June 2025
> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.

Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.

nathan_phoenix 8 June 2025
My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

bredren 8 June 2025
Great writeup.

This measure of LLM capability could be extended by taking it into the 3D domain.

That is, having the model write Python code for Blender, then running blender in headless mode behind an API.

The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)

So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.

For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.

For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.

I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.

zurichisstained 8 June 2025
Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:

``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```

But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.

It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.

I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).

https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo

https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7

https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro

Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.

(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m

joshstrange 8 June 2025
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

anon373839 8 June 2025
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
franze 8 June 2025
Here Claude Opus Extended Thinking https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c...
username223 8 June 2025
Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:

"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"

A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.

NohatCoder 8 June 2025
If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.

I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.

qwertytyyuu 8 June 2025
https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?
landgenoot 8 June 2025
If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.
joshuajooste05 8 June 2025
Does anyone have any thoughts on privacy/safety regarding what he said about GPT memory.

I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.

Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?

Joker_vD 8 June 2025
> most people find it difficult to remember the exact orientation of the frame.

Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?

I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.

irthomasthomas 8 June 2025
The best pelicans come from running a consortium of models. I use pelicans as evals now. https://x.com/xundecidability/status/1921009133077053462 Test it using VibeLab (wip) https://x.com/xundecidability/status/1926779393633857715
nowayno583 8 June 2025
That was a very fun recap, thanks for sharing. It's easy to forget how much better these things have gotten. And this was in just six months! Crazy!
djherbis 8 June 2025
Kaggle recently ran a competition to do just this (draw SVGs from prompts, using fairly small models under the hood).

The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...

JimDabell 8 June 2025
See also: The recent history of AI in 32 otters

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

buserror 9 June 2025
The hilarious bit is that this page will soon be scraped by ai-bots as learning material, and they'll all learn to draw pelicans on bicycles using this as their primary example material, as they'll be the only examples.

GIGO in motion :-)

0points 9 June 2025
So the only bird slightly resembling a pelican beak was drawn by gemini 2.5 pro. In general, none of the output resembles a pelican enough so you could separate it from "a bird".

OP seem to ignore that pelican has a distinct look when evaluating these doodles.

nine_k 8 June 2025
Am I the only one who can't but see these attempts much like attempts of a kid learning to draw?
jfengel 8 June 2025
It's not so great at bicycles, either. None of those are close to rideable.

But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.

zahlman 8 June 2025
> If you lost interest in local models—like I did eight months ago—it’s worth paying attention to them again. They’ve got good now!

> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.

You reap what you sow....

> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.

Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.

> ... but prompt injection is still a thing.

...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.

> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.

People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.

pier25 8 June 2025
Definitely getting better but even the best result is not very impressive.
darkoob12 9 June 2025
Should we be that excited about AI and calling a fraud and plagiarism machine "ChatGPT Mischief Buddy" without any moral deliberation?
spaceman_2020 8 June 2025
I don’t know what secret sauce Anthropic has, but in real world use, Sonnet is somehow still the best model around. Better than Opus and Gemini Pro
deadbabe 8 June 2025
As a control, he should go on fiver and have a human generate a pelican riding a bicycle, just to see what the eventual goal is.
wohoef 8 June 2025
Quite a detailed image using claude sonnet 4: https://ibb.co/39RbRm5W
dirtyhippiefree 8 June 2025
Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

mromanuk 8 June 2025
The last animation is hilarious, represents very well the AI Hype cycle vs reality.
big_hacker 8 June 2025
Honestly the metric which increased the most is the marketing and astroturfing budget of the major players (OpenAI, Anthropic, Google and Deepseek).

Say what you want about Facebook but at least they released their flagship model fully open.

bravesoul2 8 June 2025
Is there a good model (any architecture) for vector graphics out of interest?
neepi 8 June 2025
My only take home is they are all terrible and I should hire a professional.
atxtechbro 8 June 2025
Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings and this is great too! I like the personalized benchmark. Hopefully the big LLM providers don't start gaming the pelican index!
beefnugs 9 June 2025
I think its hilarious how humans can make mistakes interpreting the crazy drawings : He says "I like how it solved the problem of pelicans not fitting on bicycles by adding a second smaller bicycle to the stack."

no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!

NicoSchwandner 8 June 2025
Nice post, thanks!
m3047 8 June 2025
TIL: Snitchbench!