One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.
Plus it drew me a very decent pelican riding a bicycle.
Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
If you plan to use Gemini, be warned, here are the usual Big Tech dragons:
Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...
The full extract of the terms of usage:
How human reviewers improve Google AI
To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.
Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!
From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.
Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output
Recommend trying it out at aistudio dot google dot com
A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.
This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.
On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)
This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard:
https://github.com/vectara/hallucination-leaderboard
I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?
While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently.
For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini)
The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini.
Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)
Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.
curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
jq -r 'recurse(.children[]) | .author + ": " + .text' | \
llm -m "gemini-2.5-pro-exp-03-25" -s \
'Summarize the themes of the opinions expressed here.
For each theme, output a markdown header.
Include direct "quotations" (with author attribution) where appropriate.
You MUST quote directly from users when crediting them, with double quotes.
Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'
I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.
But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.
why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..
Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.
By extension it should also be slightly more helpful for research, R&D?
It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.
I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on
Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!
i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710
> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town
"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.
I tried the beta version of this model to write a business plan (long story).
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking.
I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:
1. Counting Occurrences in a Video
In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").
Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.
2. Identifying Background Music and Movie Name
Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!
3. OCR Text Recognition
Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!
4. Listen to Foreign News Media
The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.
5. Cricket Fans?
Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.
7. Webinar - Generate Slides from Video
Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!
Bonus: Humor Test
Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.
At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.
Gemini 2.5
(blog.google)973 points by meetpateltech 25 March 2025 | 484 comments
Comments
Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.
This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.
Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
with pastebin - https://pastebin.com/aiWuYcrF
Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.
For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?
Plus it drew me a very decent pelican riding a bicycle.
Notes here: https://simonwillison.net/2025/Mar/25/gemini/
https://help.kagi.com/kagi/ai/llm-benchmark.html
High quality, to the point. Bit on the slow side. Indeed a very strong model.
Google is back in the game big time.
This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.
[0] https://aider.chat/docs/leaderboards/
- Our state-of-the-art model.
- Benchmarks comparing to X,Y,Z.
- "Better" reasoning.
It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.
I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.
Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.
I hope they enable support for the recently released canvas mode for this model soon it will be a good match.
I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.
I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.
I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).
Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.
The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.
Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.
Breakthrough moment.
https://www.svgviewer.dev/s/FImn7kAo
I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.
From https://x.com/OfficialLoganK/status/1904583353954882046
The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.
1 o1-pro (medium reasoning) 82.3
2 o1 (medium reasoning) 70.8
3 o3-mini-high 61.4
4 Gemini 2.5 Pro Exp 03-25 54.1
5 o3-mini (medium reasoning) 53.6
6 DeepSeek R1 38.6
7 GPT-4.5 Preview 34.2
8 Claude 3.7 Sonnet Thinking 16K 33.6
9 Qwen QwQ-32B 16K 31.4
10 o1-mini 27.0
https://github.com/lechmazur/nyt-connections/
This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).
You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.
Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.
Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.
Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.
https://a.qoid.us/20250325/
[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1
"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)
About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.
I thought memory requirement grows exponentially with context size?
(DM me for the questions)
This is part of the code output (after several interactions of it not returning actual code):
I'm not very impressed.I pointed out that part of the code, and answered:
You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.
But then follows with even worst code. It does not even compile!
Nobody is going to say "Announcing Foobar 7.1 - not our best!"
I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.
Are there benchmarks showing o3-mini performing better than o1?
I ran this command to create it:
Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themesBut then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?
I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.
But then where is the productivity increases?
Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...
By extension it should also be slightly more helpful for research, R&D?
gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.
asked to tell a joke - horrible joke ever.
much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.
Chatgpt pronounced correctly
I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.
I also see Gemini 2.0 Pro has been replaced completely in AI Studio.
I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.
I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.
I'm curious to try again, but wary of spending too much time "playing" here.
Granted, Gemini answers it now, however, this one left me shaking my head.
https://cdn.horizon.pics/PzkqfxGLqU.jpg
Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?
I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:
1. Counting Occurrences in a Video
In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").
Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.
2. Identifying Background Music and Movie Name
Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!
3. OCR Text Recognition
Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!
4. Listen to Foreign News Media
The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.
5. Cricket Fans?
Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.
7. Webinar - Generate Slides from Video
Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!
Bonus: Humor Test
Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.
At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!
https://videotobe.com/blog/googles-gemini-25
ChatGPT4.5, I was excited.
Deepseek, I was excited. (then later disappointed)
I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.
I know I've been disappointed at the quality of Google's AI products. They are backup at best.
https://news.ycombinator.com/item?id=43474112
I don't see it on the API price list:
https://ai.google.dev/gemini-api/docs/pricing
I can imagine that it's not so interesting to most of us until we can try it with cursor.
I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.
More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.
These might be hard to mix to an overall score but they're critical for understanding usefulness.