Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison

(composio.dev)

Comments

phkahler 31 March 2025
Here is a real coding problem that I might be willing to make a cash-prize contest for. We'd need to nail down some rules. I'd be shocked if any LLM can do this:

https://github.com/solvespace/solvespace/issues/1414

Make a GTK 4 version of Solvespace. We have a single C++ file for each platform - Windows, Mac, and Linux-GTK3. There is also a QT version on an unmerged branch for reference. The GTK3 file is under 2KLOC. You do not need to create a new version, just rewrite the GTK3 Linux version to GTK4. You may either ask it to port what's there or create the new one from scratch.

If you want to do this for free to prove how great the AI is, please document the entire session. Heck make a YouTube video of it. The final test is weather I accept the PR or not - and I WANT this ticket done.

I'm not going to hold my breath.

qwertox 31 March 2025
Gemini is the only model which tells me when it's a good time to stop chatting because either it can't find a solution or because it dislikes my solution (when I actively want to neglect security).

And the context length is just amazing. When ChatGPT's context is full, it totally forgets what we were chatting about, as if it would start an entirely new chat.

Gemini lacks the tooling, there ChatGPT is far ahead, but at its core, Gemini feels like a better model.

neal_ 31 March 2025
I was using gemini 2.5 pro yesterday and it does seem decent. I still think claude 3.5 is better at following instruction then the new 3.7 model which just goes ham messing stuff up. Really disappointed by Cursor and the Claude CLI tool, for me they create more problems then fix. I cant figure out how to use them on any of my projects with out them ruining the project and creating terrible tech debt. I really like the way gemini shows how much context window is left, i think every company should have this. To be honest i think there has been no major improvement beyond the original models which gained popularity first. Its just marginal improvements 10% better or something, and the free models like deepseek are actually better imo then anything openai has. I dont think the market can withstand the valuations of the big ai companies. They have no advantage, there models suck worse then free open source ones, and they charge money??? Where is the benefit to there product?? People originally said the models are the moat and methods are top secret, but turns out its pretty easy to reproduce these models, and its the application layer built on top of the models that is much more specific and has the real moat. People said the models would engulf these applications built ontop and just integrate natively.
thicTurtlLverXX 31 March 2025
In the Rubic's cube example, to solve the cube gemini2.5 just uses the memorized scrambling sequence:

// --- Solve Function ---

function solveCube() { if (isAnimating || scrambleSequence.length === 0) return;

  // Reverse the scramble sequence
  const solveSequence = scrambleSequence
    .slice()
    .reverse()
    .map((move) => {
      if (move.endsWith("'")) return move.slice(0, 1); // U' -> U
      if (move.endsWith("2")) return move; // U2 -> U2
      return move + "'"; // U -> U'
    });

  let promiseChain = Promise.resolve();
  solveSequence.forEach((move) => {
    promiseChain = promiseChain.then(() => applyMove(move));
  });

  // Clear scramble sequence and disable solve button after solving
  promiseChain.then(() => {
    scrambleSequence = []; // Cube is now solved (theoretically)
    solveBtn.disabled = true;
    console.log("Solve complete.");
  });
}
breadwinner 31 March 2025
The loser in the AI model competition appears to be... Microsoft.

When ChatGPT was the only game in town Microsoft was seen as a leader, thanks to their wise investment in Open AI. They relied on Open AI's model and didn't develop their own. As a result Microsoft has no interesting AI products. Copilot is a flop. Bing failed to take advantage of AI, Perplexity ate their lunch.

Satya Nadella last year: “Google should have been the default winner in the world of big tech’s AI race”.

Sundar Pichai's response: “I would love to do a side-by-side comparison of Microsoft’s own models and our models any day, any time. They are using someone else's model.”

See: https://www.msn.com/en-in/money/news/sundar-pichai-vs-satya-...

anotherpaulg 31 March 2025
Gemini 2.5 Pro set a wide SOTA on the aider polyglot coding leaderboard [0]. It scored 73%, well ahead of the previous 65% SOTA from Sonnet 3.7.

I use LLMs to improve aider, which is >30k lines of python. So not a toy codebase, not greenfield.

I used Gemini 2.5 Pro for the majority of the work on the latest aider release [1]. This is the first release in a very long time which wasn't predominantly written using Sonnet.

The biggest challenge with Gemini right now is the very tight rate limits. Most of my Sonnet usage lately is just when I am waiting for Gemini’s rate limits to cool down.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...

overgard 31 March 2025
I remember back in the day when I did Visual Basic in the 90s there were a lot of cool "New Project from Template" things in Visual Studio, especially when you installed new frameworks and SDKs and stuff like that. With a click of a button you had something that kind of looked like a professional app! Or even now, the various create-whatever-app tooling in npm and node keeps on that legacy.

Anyway, AI "coding" makes me think of that but on steroids. It's fine, but the hype around it is silly, it's like declaring you can replace Microsoft Word because "New Project From Template" you got a little rich text widget in a window with a toolbar.

One of the things mentioned in the article is the writer was confused that Claude's airplane was sideways. But it makes perfect sense, Claude doesn't really care about or understand airplanes, and as soon as you try to refine these New Project From Template things the AI quickly stops being useful.

bratao 31 March 2025
From my use case, the Gemini 2.5 is terrible. I have a complex Cython code in a single file (1500 lines) for a Sequence Labeling. Claude and o3 are very good in improving this code and following the commands. The Gemini always try to do unrelated changes. For example, I asked, separately, for small changes such as remove this unused function, or cache the arrays indexes. Every time it completely refactored the code and was obsessed with removing the gil. The output code is always broken, because removing the gil is not easy.
kingkongjaffa 31 March 2025
Is there a less biased discussion?

The OP link is a thinly veiled and biased advert for something called composio and really a biased and overly flowery view of Gemini 2.5 pro.

Example:

“Everyone’s talking about this model on Twitter (X) and YouTube. It’s trending everywhere, like seriously. The first model from Google to receive such fanfare.

And it is #1 in the LMArena just like that. But what does this mean? It means that this model is killing all the other models in coding, math, Science, Image understanding, and other areas.”

antirez 31 March 2025
In complicated code I'm developing (Redis Vector Sets) I use both Claude 3.7 and Gemini 2.5 PRO to perform code reviews. Gemini 2.5 PRO can find things that are outside Claude abilities, even if Gemini, as a general purpose model, is worse. But It's inherently more powerful at reasoning on complicated code stuff, threading, logical errors, ...
sfjailbird 31 March 2025
Every test task, including the coding test, is a greenfield project. Everything I would consider using LLMs for is not. Like, I would always need it to do some change or fix on a (large) existing project. Hell, even the examples that were generated would likely need subsequent alterations (ten times more effort goes into maintaining a line of code than writing it).

So these tests are meaningless to me, as a measure of how useful these models are. Great for comparison with each other, but would be interesting to include some tests with more realistic work.

anonzzzies 31 March 2025
For Gemini: play around with the temperature: the default is terrible: we had much better results with (much) lower values.
MrScruff 31 March 2025
The evidence given really doesn't justify the conclusion. Maybe it suggests 2.5 Pro might be better if you're asking it to build Javascript apps from scratch, but that hardly equates to "It's better at coding". Feels like a lot of LLM articles follow this pattern, someone running their own toy benchmarks and confidently extrapolating broad conclusions from a handful of data points. The SWE-Bench result carries a bit more weight but even that should be taken with a pinch of salt.
amazingamazing 31 March 2025
In before people post contradictory anecdotes.

It would be more helpful if people posted the prompt, and the entire context, or better yet the conversation, so we can all judge for ourselves.

HarHarVeryFunny 31 March 2025
I'd like to see an honest attempt by someone to use one of these SOTA models to code an entire non-trivial app. Not a "vibe coding" flappy bird clone or minimal ioS app (call API to count calories in photo), but something real - say 10K LOC type of complexity, using best practices to give the AI all the context and guidance necessary. I'm not expecting the AI to replace the programmer - just to be a useful productivity tool when we move past demos and function writing to tackling real world projects.

It seems to me that where we are today, AI is only useful for coding for very localized tasks, and even there mostly where it's something commonplace and where the user knows enough to guide the AI when it's failing. I'm not at all convinced it's going to get much better until we have models that can actually learn (vs pre-trained) and are motivated to do so.

raffkede 31 March 2025
I had huge success letting Gemini 2.5 oneshot whole codebases in a single text file format and then split it up with a script. It's putting in work for like 5 minutes and spits out a working codebase, I also asked it to show of a little bit and it almost one shotted a java cloud service to generate pdf invoices from API calls, (made some minor mistakes but after feeding them back it fixed them)

I basically use two scripts one to flatten the whole codebase into one text file and one to split it, give it a shot it's amazing...

iammrpayments 31 March 2025
Theo video detected = opinion rejected

Also I generally dislike thinking models for coding and prefer faster models, so if you have something easy gemini 2.0 is good

Sol- 31 March 2025
Maybe I don't feel the AI FOMO strongly enough and obviously these performance comparisons can be interesting in their own right to keep track of AI progress, but ultimately it feels as long as you have a pro subscription of one of the leading providers (OpenAI, Anthropic or Google), you're fine.

Sure, your provider of choice might fall behind for a few months, but they'll just release a new version eventually and might come out on top again. Intelligence seems commodified enough already that I don't care as much whether I have the best or second best.

jascha_eng 31 March 2025
This is an incredibly bad test for real world use. everything the author tested was a clean slate project any LLM is going to excel on those.
veselin 31 March 2025
I noticed a similar trends in selling on X. Put a claim, peg on some product A with good sales - Cursor, Claude, Gemini, etc. Then say, the best way to use A is with our best product, guide, being MCP or something else.

For some of these I see something like 15k followers on X, but then no LinkedIn page for example. Website is always a company you cannot contact and they do everything.

skerit 31 March 2025
I've been using Gemini 2.5 Pro with Roo-Code a lot these past few days. It has really helped me a lot. I managed to get it to implemented entire features. (With some manual cleaning up at the end)

The fact that it's free for now (I know they use it for training, that's OK) is a big plus, because I've had to restart a task from scratch quite a few time. If I calculate what this would have cost me using Claude, it would have been 200-300 euros.

I've noticed that as soon as it makes a mistake (messing up the diff format is a classic), the current task is basically a total loss. For some reason, most coding tools basically just inform the model they made a mistake and should try again... but at that point, it's broken response is part of the history, and it's basically multi-shotting itself into making more mistakes. They should really just filter these out.

lherron 31 March 2025
These one-shot prompts aren't at all how most engineers use these models for coding. In my experience so far, Gemini 2.5 Pro is great at generating code but not so great at instruction following or tool usage, which are key for any iterative coding tasks. Claude is still king for that reason.
dysoco 31 March 2025
Useful article but I would rather see comparisons where it takes a codebase and tries to modify it given a series of instructions rather than attempting to zero-shot implementations of games or solving problems. I feel like it fits better the real use cases of these tools.
dsign 31 March 2025
I guess depends on the task? I have very low expectations for Gemini, but I gave it a run with a signal processing easy problem and it did well. It took 30 seconds to reason through a problem that would have taken me between 5 to 10 minutes to reason. Gemini's reasoning was sound (but it took me a couple of minutes to decide that), and it also wrote the functions with the changes (which took me an extra minute to verify). It's not a definitive win in time, but at least there was an extra pair of "eyes"--or whatever that's called with a system like this one.

All in all, I think we humans are well on our way to become legal flesh[].

[] The part of the system to whip or throw in jail when a human+LLM commit a mistake.

phforms 31 March 2025
I like using LLMs more as coding assistents than have them write the actual code. When I am thinking through problems of code organization, API design, naming things, performance optimization, etc., I found that Claude 3.7 often gives me great suggestions, points me in the right direction and helps me to weigh up pros and cons of different approaches.

Sometimes I have it write functions that are very boilerplate to save time, but I mostly like to use it as a tool to think through problems, among other tools like writing in a notebook or drawing diagrams. I enjoy programming too much that I’d want an AI to do it all for me (it also helps that I don’t do it as a job though).

paradite 31 March 2025
This is not a good comparison for real world coding tasks.

Based on my own experience and anectodes, it's worse than Claude 3.5 and 3.7 Sonnet for actual coding tasks on existing projects. It is very difficult to control the model behavior.

I will probably make a blog post on real world usage.

Extropy_ 31 March 2025
Why is Grok not in their benchmarks? I don't see comparisons to Grok in any recent announcements about models. In fact, I see practically no discussion of Grok on HN or anywhere except Twitter in general.
superkuh 31 March 2025
What is most apparent to me (putting in existing code and asking for changes) is Gemini 2.5 Pro's tendency to refuse to actually type out subroutines and routinely replace them with either stubs or comments that say, "put the subroutines back here". It makes it so even if Gemini results are good they're still broken and require lots of manual work/thinking to get the subroutines back into the code and hooked up properly.

With a 1 million token context you'd think they'd let the LLM actually use it but all the tricks to save token count just make it... not useful.

mvkel 1 April 2025
I really wish people would stop evaluating a model's coding capability with one-shots.

The vast majority of coding energy is what comes next.

Even today, sonnet-3.5 is still the best "existing code base" model. Which is gratifying (to Anthropic) and/or alarming to everyone else

evantbyrne 31 March 2025
The common issue I run into with all LLMs is that they don't seem to be able to complete the same coding tasks where googling around also fails to provide working solutions. In particular, they seem to struggle with libraries/APIs that are less mainstream.
asdf6969 31 March 2025
Does anyone know guides to integrate this with any kind of big co production application? The examples are all small toy projects. My biggest problems are like there’s 4 packages I need to change and 3 teams and half a dozen micro services are involved.

Does any LLM do this yet? I want to throw it at a project that’s in package and micro service hell and get a useful response. Some weeks I spend almost all my time cutting tickets to other teams, writing documents, and playing politics when the other teams don’t want me to touch their stuff. I know my organization is broken but this is the world I live in.

stared 31 March 2025
At this level, it is very contextual - depending on your tools, prompts, language, libraries, and the whole code base. For example, for one project, I am generating ggplot2 code in R; Claude 3.5 gives way better results than the newer Claude 3.7.

Compare and contrast https://aider.chat/docs/leaderboards/, https://web.lmarena.ai/leaderboard, https://livebench.ai/#/.

eugenekolo 31 March 2025
It's definitely an attempt to compare models, and Gemini clearly won in the tests. But, I don't think the tests are particularly good or showcasing. It's generally an easy problem to ask AI to give you greenfields JS code for common tasks, and Leetcode's been done 1000 times on Github and stackoverflow, so the solutions are all right there.

I'd like to see tests that are more complicated for AI things like refactoring an existing codebase, writing a program to auto play God of War for you, improving the response time of a keyboard driver and so on.

mvdtnz 31 March 2025
I must be missing something about Gemini. When I use the web UI it won't even let me upload source code files directly. If I manually copy some code into a directory and upload that I do get it to work, but the coding output is hilariously bad. It produces ludicrously verbose code that so far for me has been 200% wrong every time.

This is on a Gemini 2.5 Pro free trial. Also - god damn is it slow.

For context this is on a 15k LOC project built about 75% using Claude.

nprateem 31 March 2025
Sometimes these models get tripped up with a mistake. They'll add a comment to the code saying "this is now changed to [whatever]" but it hasn't made the replacement. I tell it it hasn't made the fix, it apologises and does it again. Subsequent responses lead to more profuse apologies with assertions it's definitely fixed it this time when it hasn't.

I've seen this occasionally with older Claude models, but Gemini did this to me very recently. Pretty annoying.

ldjkfkdsjnv 31 March 2025
I've been coding with both non stop the last few days, gemini 2.5 pro is not even close. For complicated bug solving, o1 pro is still far ahead of both. Sonnet 3.7 is best overall
benbojangles 31 March 2025
Don't know what the fuss is about over a dino jump game, Claude made me a flappy bird esp32 game last month in one go: https://www.instagram.com/reel/DGcgYlrI_NK/?utm_source=ig_we...
jstummbillig 31 March 2025
This has not been my experience using it with Windsurf, which touches on an interesting point: When a tool has been optimized around one model, how much is it inhibiting another (newly released) model and how much adjustment is required to take advantage of the new model? Increasingly, as tools get better, we will not directly interact with the models. I wonder how the tool makers handle this.
cadamsdotcom 31 March 2025
Very nice comparison but constrained to greenfield.

Would love to see a similar article that uses LLMs to add a feature to Gimp, or Blender.

larodi 31 March 2025
Funny how the "give e Dinosaur game" from 'single prompt' is translates into FF's dinosaur 404 not found game.
uxx 31 March 2025
Gemini takes parts of code and just writes (same as before) even when i ask it to provide full code. which for me is deal breaker
nisten 31 March 2025
They nerfed it as of sunday March 30, a lot of people noticed performance drop and it rambling.

https://x.com/nisten/status/1906141823631769983

Would be nice if this review actually wrote exactly when they conducted their test.

thedangler 31 March 2025
I still can't get any LLM to use my niche API and build out API REST requests for all the endpoints. It just makes stuff up even though it knows the api documentation. As soon as one can do that, I'll be sold. until then I feel like its all coding problems its seen in github or source code somewhere.
InTheArena 31 March 2025
The amazing bit about claude code is it's ability to read code, and fit into the existing code base. I tried visual studio code w/ roo, and it blew up my 50 daily request limit immediately. Any suggestions on better tooling for a claude code like experience with Gemeni 2.5 pro?
0x1ceb00da 31 March 2025
I tried the exact prompt and model from the blog post, but my outputs were way off—anyone else see this? This is the best of 3 output of flight simulator prompt (gemini 2.5 pro (experimental)):

https://imgur.com/0uwRbMp

stared 31 March 2025
Just a moment ago I tried to use Gemini 2.5 (in Cursor) to use Python Gemini SDK. It failed, even with a few iterations.

Then run Claude 3.7 - it worked fine.

So yeah, depends on the case. But I am surprised that model creators don't put extra effort into dealing with setting their own tools.

charcircuit 31 March 2025
>Minecraft-styled block buildings

The buildings weren't minecraft style in either case. They weren't formed on a voxel grid and the textures weren't 16x16, but rather a rectangle or at least stretched to one. Also buildings typically are not just built as a cuboid.

ionwake 31 March 2025
Sorry for the noob question, but claude has claudecode, does Gemini Pro work with any software in the same way "claudecode" works? If so what software would I use with it? Thank you.
siliconc0w 31 March 2025
This is interesting but too greenfield, someone should do one with an existing OSS project and try to add a feature or fix a bug.
gatienboquet 31 March 2025
Model is insane but the RPM limit is insane too.
willsmith72 31 March 2025
What I love with Claude is mcp with file system. Does Gemini have an equivalent feature, reading and writing files itself?
simion314 31 March 2025
yesterday Gemini refused to write a delete sql query because is dangerous!

So I am feeling super safe. /sarcasm

theonething 31 March 2025
anybody use Claude, Gemini, ChatGPT,etc for fixing css issues? I've tried with Claude 3.7 with lackluster results. I provided a screen shot and asked it to fix an unwanted artifact.

Wondering about other people's experiences.

sxp 31 March 2025
One prompt I use for testing is: "Using three.js, render a spinning donut with gl.TRIANGLE_STRIP". The catch here is that three.js doesn't support TRIANGLE_STRIP for architectural reasons[1]. Before I knew this, I got confused as to why all the AIs kept failing and gaslighting me about using TRIANGLE_STRIP. If the AI fails to tell the user that this is an impossible task, then it has failed the test. So far, I haven't found an AI that can determine that the request isn't valid.

[1] https://discourse.threejs.org/t/is-there-really-no-way-to-us...

occamschainsaw 1 April 2025
Is it just me or does Gemini fail the 4D tesseract spinning challenge? That solution looks like a 3D object spinning in 3D space. It seems Claude's solution is better (still difficult to interpret). For reference, this is what a 4D rotation projected to 3D should look like: https://en.wikipedia.org/wiki/Tesseract
mraniki 31 March 2025
TL;DR

If you want to jump straight to the conclusion, I’d say go for Gemini 2.5 Pro, it’s better at coding, has one million in context window as compared to Claude’s 200k, and you can get it for free (a big plus). However, Claude’s 3.7 Sonnet is not that far behind. Though at this point there’s no point using it over Gemini 2.5 Pro.

claudiug 31 March 2025
that guy Theo-t3 is so strange for my taste :)
igorguerrero 31 March 2025

    consistently 1-shots entire tickets
Uhh no? First of that's a huge exaggeration even on human coders, second, I think for this to be true your project is probably a blog.