Gemini 2.5 Hackernews Viewer - by Brendan Jarvis

Gemini 2.5

973 points by meetpateltech 25 March 2025 | 485 comments

Comments

og_kalu 25 March 2025

One of the biggest problems with hands off LLM writing (for long horizon stuff like novels) is that you can't really give them any details of your story because they get absolutely neurotic with it.

Imagine for instance you give the LLM the profile of the love interest for your epic fantasy, it will almost always have the main character meeting them within 3 pages (usually page 1) which is of course absolutely nonsensical pacing. No attempt to tell it otherwise changes anything.

This is the first model that after 19 pages generated so far resembles anything like normal pacing even with a TON of details. I've never felt the need to generate anywhere near this much. Extremely impressed.

Edit: Sharing it - https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

with pastebin - https://pastebin.com/aiWuYcrF

malisper 25 March 2025

I've been using a math puzzle as a way to benchmark the different models. The math puzzle took me ~3 days to solve with a computer. A math major I know took about a day to solve it by hand.

Gemini 2.5 is the first model I tested that was able to solve it and it one-shotted it. I think it's not an exaggeration to say LLMs are now better than 95+% of the population at mathematical reasoning.

For those curious the riddle is: There's three people in a circle. Each person has a positive integer floating above their heads, such that each person can see the other two numbers but not his own. The sum of two of the numbers is equal to the third. The first person is asked for his number, and he says that he doesn't know. The second person is asked for his number, and he says that he doesn't know. The third person is asked for his number, and he says that he doesn't know. Then, the first person is asked for his number again, and he says: 65. What is the product of the three numbers?

simonw 25 March 2025

I'm impressed by this one. I tried it on audio transcription with timestamps and speaker identification (over a 10 minute MP3) and drawing bounding boxes around creatures in a complex photograph and it did extremely well on both of those.

Plus it drew me a very decent pelican riding a bicycle.

Notes here: https://simonwillison.net/2025/Mar/25/gemini/

freediver 25 March 2025

Tops our benchmark in an unprecedented way.

https://help.kagi.com/kagi/ai/llm-benchmark.html

High quality, to the point. Bit on the slow side. Indeed a very strong model.

Google is back in the game big time.

anotherpaulg 25 March 2025

Gemini 2.5 Pro set the SOTA on the aider polyglot coding leaderboard [0] with a score of 73%.

This is well ahead of thinking/reasoning models. A huge jump from prior Gemini models. The first Gemini model to effectively use efficient diff-like editing formats.

[0] https://aider.chat/docs/leaderboards/

Oras 25 March 2025

These announcements have started to look like a template.

- Our state-of-the-art model.

- Benchmarks comparing to X,Y,Z.

- "Better" reasoning.

It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.

greatgib 25 March 2025

If you plan to use Gemini, be warned, here are the usual Big Tech dragons:

   Please don’t enter ...confidential info or any data... you wouldn’t want a reviewer to see or Google to use ...

The full extract of the terms of usage:

   How human reviewers improve Google AI

   To help with quality and improve our products (such as the generative machine-learning models that power Gemini Apps), human reviewers (including third parties) read, annotate, and process your Gemini Apps conversations. We take steps to protect your privacy as part of this process. This includes disconnecting your conversations with Gemini Apps from your Google Account before reviewers see or annotate them. Please don’t enter confidential information in your conversations or any data you wouldn’t want a reviewer to see or Google to use to improve our products, services, and machine-learning technologies.

mindwok 26 March 2025

Just adding to the praise: I have a little test case I've used lately which was to identify the cause of a bug in a Dart library I was encountering by providing the LLM with the entire codebase and description of the bug. It's about 360,000 tokens.

I tried it a month ago on all the major frontier models and none of them correctly identified the fix. This is the first model to identify it correctly.

jnd0 25 March 2025

> with Gemini 2.5, we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training. Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.

Been playing around with it and it feels intelligent and up to date. Plus is connected to the internet. A reasoning model by default when it needs to.

I hope they enable support for the recently released canvas mode for this model soon it will be a good match.

vineyardmike 25 March 2025

I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?

I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.

I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.

jorl17 25 March 2025

Just a couple of days ago I wrote on reddit about how long context models are mostly useless to me, because they start making too many mistakes very fast. They are vaguely helpful for "needle in a haystack" problems, not much more.

I have a "test" which consists in sending it a collection of almost 1000 poems, which currently sit at around ~230k tokens, and then asking a bunch of stuff which requires reasoning over them. Sometimes, it's something as simple as "identify key writing periods and their differences" (the poems are ordered chronologically). Previous models don't usually "see" the final poems — they get lost, hallucinate and are pretty much worthless. I have tried several workaround techniques with varying degrees of success (e.g. randomizing the poems).

Having just tried this model (I have spent the last 3 hours probing it), I can say that, to me, this is a breakthrough moment. Truly a leap. This is the first model that can consistently comb through these poems (200k+ tokens) and analyse them as a whole, without significant issues or problems. I have no idea how they did it, but they did it.

The analysis of this poetic corpus has few mistakes and is very, very, very good. Certainly very good in terms of how quickly it produces an answer — it would take someone days or weeks of thorough analysis.

Of course, this isn't about poetry — it's about passing in huge amounts of information, without RAG, and having a high degree of confidence in whatever reasoning tasks this model performs. It is the first time that I feel confident that I could offload the task of "reasoning" over large corpus of data to an LLM. The mistakes it makes are minute, it hasn't hallucinated, and the analysis is, frankly, better than what I would expect of most people.

Breakthrough moment.

nickandbro 25 March 2025

Wow, was able to nail the pelican riding on a bicycle test:

https://www.svgviewer.dev/s/FImn7kAo

falcor84 25 March 2025

I'm most impressed by the improvement on Aider Polyglot; I wasn't expecting it to get saturated so quickly.

I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.

ekojs 25 March 2025

> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!

From https://x.com/OfficialLoganK/status/1904583353954882046

The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.

zone411 25 March 2025

Scores 54.1 on the Extended NYT Connections Benchmark, a large improvement over Gemini 2.0 Flash Thinking Experimental 01-21 (23.1).

1 o1-pro (medium reasoning) 82.3

2 o1 (medium reasoning) 70.8

3 o3-mini-high 61.4

4 Gemini 2.5 Pro Exp 03-25 54.1

5 o3-mini (medium reasoning) 53.6

6 DeepSeek R1 38.6

7 GPT-4.5 Preview 34.2

8 Claude 3.7 Sonnet Thinking 16K 33.6

9 Qwen QwQ-32B 16K 31.4

10 o1-mini 27.0

https://github.com/lechmazur/nyt-connections/

og_kalu 25 March 2025

From the 2.0 line, the Gemini models have been far better at Engineering type questions (fluids etc) than GPT, Claude especially with questions that have Images that require more than just grabbing text. This is even better.

M4v3R 25 March 2025

The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.

nikcub 25 March 2025

Impressive model - but I'm confused by the knowledge cutoff. AI Studio says it is January 2025 (which would be impressive) but querying it for anything early 2025 or mid/late 2024 and it self-reports that it's cutoff is in 2023 (which can't be right).

This is most evident when querying about fast-moving dev tools like uv or bun. It seems to only know the original uv options like pip and tools, while with bun it is unfamiliar with bun outdated (from Aug 2024), bun workspaces (from around that time?) but does know how to install bun on windows (April 2024).

You'll still need to provide this model with a lot of context to use it with any tooling or libraries with breaking changes or new features from the past ~year - which seems to contradict the AI Studio reported knowledge cutoff.

Were I developing models - I'd prioritise squeezing in the most recent knowledge of popular tools and libraries since development is such a popular (and revenue generating) use case.

Dowwie 25 March 2025

This model is a fucking beast. I am so excited about the opportunities this presents.

comex 25 March 2025

I was recently trying to replicate ClaudePlaysPokemon (which uses Claude 3.7) using Gemini 2.0 Flash Thinking, but it was seemingly getting confused and hallucinating significantly more than Claude, making it unviable (although some of that might be caused by my different setup). I wonder if this new model will do better. But I can't easily test it: for now, even paid users are apparently limited to 50 requests per day [1], which is not really enough when every step in the game is a request. Maybe I'll try it anyway, but really I need to wait for them to "introduce pricing in the coming weeks".

Edit: I did try it anyway and so far the new model is having similar hallucinations. I really need to test my code with Claude 3.7 as a control, to see if it approach the real ClaudePlaysPokemon's semi-competence.

Edit 2: Here's the log if anyone is curious. For some reason it's letting me make more requests than the stated rate limit. Note how at 11:27:11 it hallucinates on-screen text, and earlier it thinks some random offscreen tile is the stairs. Yes, I'm sure this is the right model: gemini-2.5-pro-exp-03-25.

https://a.qoid.us/20250325/

[1] https://ai.google.dev/gemini-api/docs/rate-limits#tier-1

ascorbic 26 March 2025

It can answer my favourite riddle for LLMs:

"Anna, Becca and Clare go to the play park. There is nobody else there. Anna is playing on the see-saw, Becca is playing on the swings. What is Clare doing?" (Sometimes I ask similar questions with the same structure and assumptions but different activities)

About a year ago none of them could answer it. All the latest models can pass it if I tell them to think hard, but previously Gemini could rarely answer it without that extra hint. Gemini 2.5 caveats its answer a bit, but does get it correct. Interestingly GPT-4o initially suggests it will give a wrong answer without thinking, but recognises it's a riddle, so decides to think harder and gets it right.

andai 25 March 2025

How does Gemini have such a big context window?

I thought memory requirement grows exponentially with context size?

batata_frita 25 March 2025

Why do I have the feel that nobody is too much excited to google's models compared to other companies?

arjun_krishna1 25 March 2025

I've been using Gemini Pro for my University of Waterloo capstone engineering project. Really good understanding of PDF documents and good reasoning as well as structured output Recommend trying it out at aistudio dot google dot com

d3nj4l 25 March 2025

A model that is better on Aider than Sonnet 3.7? For free, right now? I think I'll give it a spin this weekend on a couple of projects, seems too good to be true.

summerlight 25 March 2025

This looks like the first model where Google seriously comes back into the frontier competition? 2.0 flash was nice for the price but it's more focused on efficiency, not the performance.

serjester 25 March 2025

I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost.

Davidzheng 25 March 2025

On initial thoughts, I think this might be the first AI model to be reliably helpful as a research assistant in pure mathematics (o3-mini-high can be helpful but is more prone to hallucinations)

strstr 25 March 2025

It's a lot better at my standard benchmark "Magic: The Gathering" rules puzzles. Gets the answers right (both the outcome and rationale).

asah 25 March 2025

It nailed my two hard reasoning+linguistic+math questions in one shot, both the kinds of things that LLM struggle but humans do well.

(DM me for the questions)

ofermend 26 March 2025

This model is quite impressive. Not just useful for math/research with great reasoning, it also maintained a very low hallucination rate of 1.1% on Vectara Hallucination Leaderboard: https://github.com/vectara/hallucination-leaderboard

jasonpeacock 25 March 2025

Isn't every new AI model the "most <adjective>"?

Nobody is going to say "Announcing Foobar 7.1 - not our best!"

f1shy 26 March 2025

One test I always do is ask for an absolutely minimal language interpreter with TCO.

This is part of the code output (after several interactions of it not returning actual code):

        // Tail Call Optimization (very basic)
        if(func->type == VAL_FUNCTION){
            return apply(func, args, env); //no stack growth.
        }
        else{
            return apply(func, args, env);
        }

I'm not very impressed.

I pointed out that part of the code, and answered:

You've correctly pointed out that the TCO implementation in the provided C code snippet is essentially a no-op. The if and else blocks do the same thing: they both call apply(func, args, env). This means there's no actual tail call optimization happening; it's just a regular function call.

But then follows with even worst code. It does not even compile!

lvl155 25 March 2025

With recent pace of model updates, I wonder which factor is more important: hardware assets, software/talent, or data access. Google clearly is in the lead in terms of data access in my view. If I am a top talent in AI, I’d go where I can work with the best data no?

Medicineguy 26 March 2025

While I'm sure the new Gemini model has made improvements, I feel like the user experience outside of the model itself is stagnating. I think OpenAI's interfaces, both web app and mobile app, are quite a bit more polished currently. For example, Gemini's speech recognition struggles with longer pauses and often enough cuts me off mid-sentence. Also, OpenAIs whisper model understands more context (for instance, saying “[...] plex, emby and Jellyfin [...]” is usually understood in whisper, but less often in Gemini) The Gemini web app lacks keyboard shortcuts for basic actions like opening a new chat or toggling the sidebar (good for privacy friendly pair programming). Last point off the top of my head would be the ability to edit messages beyond just the last one. That's possible in ChatGPT, but not in Gemini. Googlers are spending so much money for model training, I would appreciate spending some for making it fun to use :)

barrenko 25 March 2025

The incumbent has awoken.

cj 25 March 2025

Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.

I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.

Are there benchmarks showing o3-mini performing better than o1?

jszymborski 26 March 2025

Gemini refuses to answer any questions on poprtional swing models or anything related to psephology on the grounds that it has to do with elections. Neither Claude nor ChatGPT nor Mistral/Le Chat are that neutered.

WasimBhai 25 March 2025

I do not intend to take anything away from the technical achievement of the team. However, as Satya opined some weeks back, these benchmarks do not mean a lot if we do not see a comparable increase in productivity.

But then there are two questions. First, are the white collar workers specifically consultants, engineers responsible for increase in productivity? Or is the white collar workers at the very right tail e.g., scientists?

I think consultants and engineers are using these technologies a lot. I think biologists at least are using these models a lot.

But then where is the productivity increases?

simonw 26 March 2025

Here's a Gemini 2.5 provided summary of this Hacker News thread as of the moment when it had 269 comments: https://gist.github.com/simonw/3efa62d917370c5038b7acc24b7c7...

I ran this command to create it:

  curl -s "https://hn.algolia.com/api/v1/items/43473489" | \
    jq -r 'recurse(.children[]) | .author + ": " + .text' | \
    llm -m "gemini-2.5-pro-exp-03-25" -s \
    'Summarize the themes of the opinions expressed here.
    For each theme, output a markdown header.
    Include direct "quotations" (with author attribution) where appropriate.
    You MUST quote directly from users when crediting them, with double quotes.
    Fix HTML entities. Output markdown. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece'

Using this script: https://til.simonwillison.net/llms/claude-hacker-news-themes

jharohit 25 March 2025

why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..

TheMagicHorsey 25 March 2025

I tested out Gemini 2.5 and it failed miserably at calling into tools that we had defined for it. Also, it got into an infinite loop a number of times where it would just spit out the exact same line of text continuously until we hard killed the process. I really don't know how others are getting these amazing results. We had no problems using Claude or OpenAI models in the same scenario. Even Deepseek R1 works just fine.

rodolphoarruda 25 March 2025

I've been trying to use Gemini 2.0 Flash, but I don't think it's possible. The model still thinks it's running the 1.5 Pro model.

Reference: https://rodolphoarruda.pro.br/wp-content/uploads/image-14.pn...

joshdavham 25 March 2025

When these companies release a model “2.5”, are they using some form of semver? Where are these numbers coming from?

daquisu 25 March 2025

Weird, they released Gemini 2.5 but I still can't use 2.0 pro with a reasonable rate limit (5 RPM currently).

andai 25 March 2025

Can anyone share what they're doing with reasoning models? They seem to only make a difference with novel programming problems, like Advent of Code. So this model will help solve slightly harder advent of codes.

By extension it should also be slightly more helpful for research, R&D?

xnx 25 March 2025

It will be huge achievement if models can get to the point where so much selection effort isn't required: gemini.google.com currently lists 2.0 Flash, 2.0 Flash Thinking (experimental), Deep Research, Personalization (experimental), and 2.5 Pro (experimental) for me.

DaveMcMartin 26 March 2025

I love to see this competition between companies trying to get the best LLM, and also, the fact that they’re trying to make them useful as tools, focusing on math, science, coding, and so on

afro88 25 March 2025

Is this the first model announcement where they show Aider's Polyglot benchmark in the performance comparison table? That's huge for Aider and anotherpaulg!

andai 26 March 2025

I asked it for suggestions for a project, and it was the only model that correctly pointed out serious flaws in the existing proposal. So far so good!

t_minus_40 26 March 2025

i have asked the direction of friction on ball rolling either up or down on an inclined plan - it gave wrong answer and was adamant about it. Surprisingly, similar to o1.

gave a problem which sounds like monty hall problem but a simple probability question and it nailed it.

asked to tell a joke - horrible joke ever.

much better than o1 but still no where near agi. it has been optimized for logic and reasoning at best.

theragra 26 March 2025

Yeah, and then it says that call of duty is pronounced call of dah-tee when I speak in Russian.

Chatgpt pronounced correctly

mclau156 25 March 2025

Generated 1000 lines of turn based combat with shop, skills, stats, elements, enemy types, etc. with this one

slama 25 March 2025

Interestingly, the model hallucinated the ability to use a search tool when I was playing around with it

dcchambers 25 March 2025

> Developers and enterprises can start experimenting with Gemini 2.5 Pro in Google AI Studio now, and Gemini Advanced users can select it in the model dropdown on desktop and mobile. It will be available on Vertex AI in the coming weeks.

I'm a Gemini Advanced subscriber, still don't have this in the drop-down model selection in the phone app, though I do see it on the desktop webapp.

billforsternz 26 March 2025

I know next to nothing about AI, but I just experienced an extraordinary hallucination in a google AI search (presumably an older Gemini model right?) as I elaborated in detail in another HN thread. It might be a good test question. https://news.ycombinator.com/item?id=43477710

honeybadger1 25 March 2025

Claude is still the king right now for me. Grok is 2nd in line, but sometimes it's better.

testycool 26 March 2025

It feels like Gemini 2.0 Pro + Reasoning.

I also see Gemini 2.0 Pro has been replaced completely in AI Studio.

Alifatisk 26 March 2025

Can't wait for the benchmark at artificialanalysis.ai

vivzkestrel 25 March 2025

hi, here is our new AI model, it performs task A x% better than our competitor 1, task B y% better than our competitor 2 seems to be the new hot AI template in town

eenchev 25 March 2025

"My info, the stuff I was trained on, cuts off around early 2023." - Gemini 2.5 to me. Appears that they did a not-so-recent knowledge cutoff in order to use the best possible base model.

joelthelion 25 March 2025

Is this model going to be restricted to paying users?

marcus_holmes 26 March 2025

I tried the beta version of this model to write a business plan (long story).

I was impressed at first. Then it got really hung up on the financial model, and I had to forcibly move it on. After that it wrote a whole section in Indonesian, which I don't speak, and then it crashed. I'd not saved for a while (ever since the financial model thing), and ended up with an outline and a couple of usable sections.

I mean, yes, this is better than nothing. It's impressive that we made a pile of sand do this. And I'm aware that my prompt engineering could improve a lot. But also, this isn't a usable tool yet.

I'm curious to try again, but wary of spending too much time "playing" here.

pachico 25 March 2025

It really surprises me that Google and Amazon, considering their infrastructure and the urge to excel at this, aren't leading the industry.

andrewinardeer 25 March 2025

Google is overly cautious with their guardrails.

Granted, Gemini answers it now, however, this one left me shaking my head.

https://cdn.horizon.pics/PzkqfxGLqU.jpg

fourseventy 25 March 2025

Does it think the founding fathers were a diverse group of mixed races and genders like the last model did?

noisy_boy 25 March 2025

Is Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have..."

skinkestek 26 March 2025

Can it now generate images of soldier in typical uniforms from 1940s Germany without having to throw in a few token ethnicities?

Or generate images of the founding fathers of US that at least to some degree resemble the actual ones?

meerab 1 April 2025

The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking.

I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are few standout use cases that showcase its potential:

1. Counting Occurrences in a Video

In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").

Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.

2. Identifying Background Music and Movie Name

Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!

3. OCR Text Recognition

Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!

4. Listen to Foreign News Media

The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.

5. Cricket Fans?

Sports fans and analysts alike will appreciate this use case! I tested Gemini 2.5 on an ICC T20 World Cup cricket match video to see how well it could analyze gameplay data. The results were incredible: the model accurately calculated scores, identified the number of fours and sixes, and even pinpointed key moments—all while providing timestamps for each event.

7. Webinar - Generate Slides from Video

Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!

Bonus: Humor Test

Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.

At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!

https://videotobe.com/blog/googles-gemini-25

cp9 25 March 2025

does it still suggest glue on pizza

resource_waste 25 March 2025

I'll try it tonight, but I'm not excited, its just work.

ChatGPT4.5, I was excited.

Deepseek, I was excited. (then later disappointed)

I know Gemini probably wont answer any medical question, even if you are a doctor. ChatGPT will.

I know I've been disappointed at the quality of Google's AI products. They are backup at best.

noisy_boy 25 March 2025

Are Gemini and Bard same? I asked it a question and it said "... areas where I, as Bard, have...."

ototot 25 March 2025

And OpenAI is announcing their ImageGen in 4o

https://news.ycombinator.com/item?id=43474112

throwaway13337 25 March 2025

Google has this habit of 'releasing' without releasing AI models. This looks to be the same?

I don't see it on the API price list:

https://ai.google.dev/gemini-api/docs/pricing

I can imagine that it's not so interesting to most of us until we can try it with cursor.

I look forward to doing so when it's out. That Aider bench mixed with the speed and a long context window that their other models are known for could be a great mix. But we'll have to wait and see.

More generally, it woud be nice for these kinds of releases to also add speed and context window as a separate benchmark. Or somehow include it in the score. A model that is 90% as good as the best but 10x faster is quite a bit more useful.

These might be hard to mix to an overall score but they're critical for understanding usefulness.