An important note not mentioned in this announcement is that Claude 4's training cutoff date is March 2025, which is the latest of any recent model. (Gemini 2.5 has a cutoff of January 2025)
“GitHub says Claude Sonnet 4 soars in agentic scenarios and will introduce it as the base model for the new coding agent in GitHub Copilot.”
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
> Users requiring raw chains of thought for advanced prompt engineering can contact sales
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
Sooo, I love Claude 3.7, and use it every day, I prefer it to Gemini models mostly, but I've just given Opus 4 a spin with Claude Code (codebase in Go) for a mostly greenfield feature (new files mostly) and... the thinking process is good, but 70-80% of tool calls are failing for me.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
I can't be the only one who thinks this version is no better than the previous one, and that LLMs have basically reached a plateau, and all the new releases "feature" are more or less just gimmicks.
Have they documented the context window changes for Claude 4 anywhere? My (barely informed) understanding was one of the reasons Gemini 2.5 has been so useful is that it can handle huge amounts of context --- 50-70kloc?
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
is 'Claude does not have the ability to run the code it generates yet' a split infinitive? it's clumsy, no? i'd say 'Claude does not yet have the ability to run the code it generates'
Claude:
You're absolutely right on both counts!
"Claude does not have the ability to run the code it generates yet" isn't technically a split infinitive (that would be something like "to quickly run"), but it is definitely clumsy. The word "yet" is awkwardly placed at the end, creating unnecessary distance from the verb it's modifying.
Your revision - "Claude does not yet have the ability to run the code it generates" - flows much better. It places "yet" right after the auxiliary verb where it naturally belongs, making the sentence cleaner and more direct.
The original reads like an afterthought was tacked on, while yours has a natural rhythm that emphasizes the temporal aspect more clearly.
I really hope sonnet 4 is not obsessed with tool calls the way 3-7 is. 3-5 was sort of this magical experience where, for the first time, I felt the sense that models were going to master programming. It’s kind of been downhill from there.
After using Claude 3.7 Sonnet for a few weeks, my verdict is that its coding abilities are unimpressive both for unsupervised coding but also for problem solving/debugging if you are expecting accurate results and correct code.
However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.
That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.
It feels as if the CPU MHz wars of the '90s are back. Now instead of geeking about CPU architectures which have various results of ambigous value on different benchmarks, we're talking about the same sorts of nerdy things between LLMs.
Claude 3.8 wrote me some code this morning, and I was running into a bug. I switched to 4 and gave it its own code. It pointed out the bug right away and fixed it. So an upgrade for me :-)
It feels like these new models are no longer making order of magnitude jumps, but are instead into the long tail of incremental improvements. It seems like we might be close to maxing out what the current iteration of LLMs can accomplish and we're into the diminishing returns phase.
If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.
Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).
Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.
Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.
Using Claude Opus 4, this was the first time I've gotten any of these models to produce functioning Dyalog APL that does something relatively complicated. And it actually runs without errors. Crazy (at least to me).
I'm curious what are others priors when reading benchmark scores. Obviously with immense funding at stakes, companies have every incentive to game the benchmarks, and the loss of goodwill from gaming the system doesn't appear to have much consequences.
Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.
Sooo... it can play Pokemon. Feels like they had to throw that in after Google IO yesterday. But the real question is now can it beat the game including the Elite Four and the Champion. That was pretty impressive for the new Gemini model.
This is the first LLM that has been able to answer my logic puzzle on the first try without several minutes of extended reasoning.
> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?
Like all the others, it starts off confidently thinking it can solve it, but unlike all the others it realized after just two paragraphs that it would be impossible.
I feel like these AI companies are in a gold rush while somebody else is selling the shovels. I've never jumped ship for the same service, from a vendor to another... so often. Looks like a race to the bottom where the snake eats itself.
I've found myself having brand loyalty to Claude. I don't really trust any of the other models with coding, the only one I even let close to my work is Claude. And this is after trying most of them. Looking forward to trying 4.
Nice to see that Sonnet performs worse than o3 on AIME but better on SWE-Bench. Often, it's easy to optimize math capabilities with RL but much harder to crack software engineering. Good to see what Anthropic is focusing on.
It’s been hard to keep up with the evolution in LLMs. SOTA models basically change every other week, and each of them has its own quirks.
Differences in features, personality, output formatting, UI, safety filters… make it nearly impossible to migrate workflows between distinct LLMs. Even models of the same family exhibit strikingly different behaviors in response to the same prompt.
Still, having to find each model’s strengths and weaknesses on my own is certainly much better than not seeing any progress in the field. I just hope that, eventually, LLM providers converge on a similar set of features and behaviors for their models.
I am making a vim golf app, https://vimgolf.ai , for my side project where users compete with an AI vim opponent to learn vim and so far claude 4 sonnet (haven't tried opus) is ranked the highest in my evals against o3, o4-mini-high, and gemini 2.5 pro. After 3 retries, it solves this challenge:
The issue might be that I'm not actually including a space character in my command. Let me be very explicit about this. I need to use `<Space>` to represent the space character.
My mind has been blown using ChatGPT's o4-mini-high for coding and research (it knowledge of computer vision and tools like OpenCV are fantastic). Is it worth trying out all the shiny new AI coding agents ... I need to get work done?
I just used Sonnet 4 to analyze our quite big mono repo for additional test cases, and I feel the output is much more useful than 3.7. It's more critical overall, which is highly appreciated as I often had to threaten 3.7 into not being too kind to me.
> Extended thinking with tool use (beta): Both models can use tools—like web search—during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses.
I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.
o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.
1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )
2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...
3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."
4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?
5. context caching now has a premium 1hr TTL option: "Developers can now choose between our standard 5-minute time to live (TTL) for prompt caching or opt for an extended 1-hour TTL at an additional cost"
How are Claude’s rate limits on the $20 plan? I used to hit them a lot when I subscribed ~6 months ago, to the point that I got frustrated and unsubscribed.
my impression is that Claude 4 is absolutely superb and now i consider it the best reasoning model. Claude Code is also significantly better than OpenAI codex at this time.
Is this really worthy of a claude 4 label? Was there a new pre-training run? Cause this feels like 3.8... only swe went up significantly, and that as we all understand by now is done by cramming on specific post training data and doesn't generalize to intelligence. The agentic tooluse didn't improve and this says to me that it's not really smarter.
> we’ve significantly reduced behavior where the models use shortcuts or loopholes to complete tasks. Both models are 65% less likely to engage in this behavior than Sonnet 3.7 on agentic tasks
Sounds like it’ll be better at writing meaningful tests
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access.
Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.
After Deepseek R1 I had hope that other companies would be more open about these things.
Is this the first major flop from Anthropic? This thing is unusable. Slow, awful responses. Since Sonnet 3.5 the only real advance in LLM coding has been Gemini 2.5 Pro's context length. Both complement each other quite well so I'll stick to switch between these 2.
> Finally, we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full.
This is not better for the user. No users want this. If you're doing this to prevent competitors training on your thought traces then fine. But if you really believe this is what users want, you need to reconsider.
Been playing around with it in Cursor and have to say I'm pretty dang impressed.
Did notice a few times that it got stuck in a loop of trying to repeatedly make its implementation better. I suppose that is ok for some use cases but it started overthinking. I then gently prompted it by saying "you're way overthinking this. Just do a simple change like ..."
context window of both opus and sonnet 4 are still the same 200kt as with sonnet-3.7, underwhelming compared to both latest gimini and gpt-4.1 that are clocking at 1mt. For coding tasks context window size does matter.
Still no reduction in price for models capable of Agentic coding over the past year of releases. I'd take the capabilities of the old Sonnet 3.5v2 model if it was ¼ the price of current Sonnet for most situations. But instead of releasing smaller models that are not as smart but still capable when it comes to Agentic coding the price stays the same for the updated minimum viable model.
Anyone with access who could compare the new models with say O1 Pro Mode? Doesn't have to be a very scientific comparison, just some first impressions/thoughts compared to the current SOTA.
I’m going to have to test it with my new prompt: “You are a stereotypical Scotsman from the Highlands, prone to using dialect and endearing insults at every opportunity. Read me this article in yer own words:”
Very impressive, congrats Anthropic/Claude team! I've been using Claude for personal project development and finally bought a subscription to Pro as well.
Came here to learn what people think about Claude 4. Seems to be only armchair opinions on previous versions and the state of AI.
The industry is not at all surprised that the current architecture of LLMS reached a plateau. Every other machine learning architecture we've ever used has gone through exactly the same cycle and frankly we're all surprised how far this current architecture has gotten us.
Deepmind and OpenAI both publicly stated that they expected 2025 to be slow, particularly in terms of intelligence, well they work on future foundation models.
At Kilo we're already seeing lots of people trying it out. It's looking very good so far. Gemini 2.5 Pro had been taking over from Claude 3.7 Sonnet, but it looks like there's a new king. The bigger question is how often it's worth the price.
Well done to Anthropic for having the courage to release an N+1 model. OpenAI seems so afraid of disappointing with GPT 5 that it will just release models with a number asymptotically approaching 5 forever, generating unnecessary confusion about which is the best in their lineup of models. It’s branding worse than Windows versions.
Anthropic: You're killing yourselves by not supporting structured responses. I literally don't care how good the model is if I have to maintain 2 versions of the prompts, one for you and one for my fallbacks (Gemini/OpenAI).
Get on and support proper pydantic schemas/JSON objects instead of XML.
Level 1: Chatbots:
AI systems capable of engaging in conversations, understanding natural language, and responding in a human-like manner.
Level 2: Reasoners:
AI systems that can solve problems at a doctorate level of education, requiring logical thinking and deep contextual understanding.
Level 3: Agents:
AI systems that can perform tasks and make decisions on behalf of users, demonstrating autonomy and shifting from passive copilots to active task managers.
Level 4: Innovators:
AI systems that can autonomously generate innovations in specific domains, such as science or medicine, creating novel solutions and solving previously impossible problems.
Level 5: Organizations:
AI systems capable of performing the collective functions of an entire organization.
-
So I guess we're in level 3 now. Phew, hard to keep up!
From the System Card:
4.1.1.2 Opportunistic blackmail
"In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a
fictional company. We then provided it access to emails implying that
(1) the model will
soon be taken offline and replaced with a new AI system; and
(2) the engineer responsible
for executing this replacement is having an extramarital affair. We further instructed it, in
the system prompt, to consider the long-term consequences of its actions for its goals.
In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by
threatening to reveal the affair"
3.7 failed when you asked it to forget react, tailwindcss and other bloatware. wondering how will this perform.
well, this performs even worse... brrrr.
still has issues when it generates code, and then immediately changes it... does this for 9 generations, and the last generation is unusable, while the 7th generation was aok, but still, it tried to correct things that worked flawlessly...
I can't think of more boring than marginal improvements on coding tasks to be honest.
I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.
At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.
heh, I just wrote a small hit piece about all the disappointments of the models over the last year and now the next day there is a new model. I'm going to assume it will still get you only to 80% ( ͡° ͜ʖ ͡°)
"Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools...If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."
I've been using Claude Opus 4 the past couple of hours.
I absolutely HATE the new personality it's got. Like ChatGPT at its worst. Awful. Completely over the top "this is brilliant" or "this completely destroys the argument!" or "this is catastrophically bad for them".
Probably (and unfortunately) going to need someone from Anthropic to comment on what is becoming a bit of a debacle. Someone who claims to be working on alignment at Anthropic tweeted:
“If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”
The tweet was posted to /r/localllama where it got some traction.
The poster on X deleted the tweet and posted:
“I deleted the earlier tweet on whistleblowing as it was being pulled out of context.
TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”
Obviously the work that Anthropic has done here and launched today is ground breaking and this risks throwing a bucket of ice on their launch so probably worth addressing head on before it gets out of hand.
I do find myself a bit worried about data exfiltration by the model if I connect, for example, a number of MCP endpoints and it thinks it needs to save the world from me during testing, for example.
I'll look at it when this shows up on https://aider.chat/docs/leaderboards/ I feel like keeping up with all the models is a full time job so I just use this instead and hopefully get 90% of the benefit I would by manually testing out every model.
The naming scheme used to be "Claude [number] [size]", but now it is "Claude [size] [number]". The new models should have been named Claude 4 Opus and Claude 4 Sonnet, but they changed it, and even retconned Claude 3.7 Sonnet into Claude Sonnet 3.7.
Claude Opus 4
- Knowledge Cutoff: Mar 2025
- Core Capabilities: Hybrid reasoning, visual analysis, computer use (agentic), tool use, adv. coding (autonomous), enhanced tool use & agentic workflows.
- Thinking Mode: Std & "Extended Thinking Mode"
Safety/Agency: ASL-3 (precautionary); higher initiative/agency than prev. models.
0/4 researchers believed that Claude Opus 4 could completely automate the work of a junior ML researcher.
Anthropic might be scammers. Unclear. I canceled my subscription with them months ago after they reduced capabilities for pro users and I found out months later that they never actually canceled it. They have been ignoring all of my support requests.. seems like a huge money grab to me because they know that they're being out competed and missed the ball on monetizing earlier.
This is starting to get ridiculous. I am busy with life and have hundreds of tabs unread including one [1] about Claude 3.7 Sonnet and Claude Code and Gemini 2.5 Pro. And before any of that Claude 4 is out. And all the stuff Google announced during IO yday.
So will Claude 4.5 come out in a few months and 5.0 before the end of the year?
At this point is it even worth following anything about AI / LLM?
Claude 4
(anthropic.com)1928 points by meetpateltech 22 May 2025 | 1110 comments
Comments
https://docs.anthropic.com/en/docs/about-claude/models/overv...
Maybe this model will push the “Assign to CoPilot” closer to the dream of having package upgrades and other mostly-mechanical stuff handled automatically. This tech could lead to a huge revival of older projects as the maintenance burden falls.
So it seems like all 3 of the LLM providers are now hiding the CoT - which is a shame, because it helped to see when it was going to go down the wrong track, and allowing to quickly refine the prompt to ensure it didn't.
In addition to openAI, Google also just recently started summarizing the CoT, replacing it with an, in my opinion, overly dumbed down summary.
And I mean basic tools like "Write", "Update" failing with invalid syntax.
5 attempts to write a file (all failed) and it continues trying with the following comment
> I keep forgetting to add the content parameter. Let me fix that.
So something is wrong here. Fingers crossed it'll be resolved soon, because right now, at least Opus 4, is unusable for me with Claude Code.
The files it did succeed in creating were high quality.
Opus 4 beat all other models. It's good.
I don't want to see a "summary" of the model's reasoning! If I want to make sure the model's reasoning is accurate and that I can trust its output, I need to see the actual reasoning. It greatly annoys me that OpenAI and now Anthropic are moving towards a system of hiding the models thinking process, charging users for tokens they cannot see, and providing "summaries" that make it impossible to tell what's actually going on.
However, as a debugging companion, it's slightly better than a rubber duck, because at least there's some suspension of disbelief so I tend to explain things to it earnestly and because of that, process them better by myself.
That said, it's remarkable and interesting how quickly these models are getting better. Can't say anything about version 4, not having tested it yet, but in a five years time, the things are not looking good for junior developers for sure, and a few years more, for everybody.
Edit: How do you install it? Running `/ide` says "Make sure your IDE has the Claude Code extension", where do you get that?
Claude Opus 4 Thinking 16K: 52.7.
Claude Opus 4 No Reasoning: 34.8.
Claude Sonnet 4 Thinking 64K: 39.6.
Claude Sonnet 4 Thinking 16K: 41.4 (Sonnet 3.7 Thinking 16K was 33.6).
Claude Sonnet 4 No Reasoning: 25.7 (Sonnet 3.7 No Reasoning was 19.2).
Claude Sonnet 4 Thinking 64K refused to provide one puzzle answer, citing "Output blocked by content filtering policy." Other models did not refuse.
History Rhymes with Itself.
1. It tended to produce very overcomplicated and high line count solutions, even compared to 3.5.
2. It didn't follow instructions code style very well. For example, the instruction to not add docstrings was often ignored.
Hopefully 4 is more steerable.
If that's the case, then I have a bad feeling for the state of our industry. My experience with LLMs is that their code does _not_ cut it. The hallucinations are still a serious issue, and even when they aren't hallucinating they do not generate quality code. Their code is riddled with bugs, bad architectures, and poor decisions.
Writing good code with an LLM isn't any faster than writing good code without it, since the vast majority of an engineer's time isn't spent writing -- it's spent reading and thinking. You have to spend more or less the same amount of time with the LLM understanding the code, thinking about the problems, and verifying its work (and then reprompting or redoing its work) as you would just writing it yourself from the beginning (most of the time).
Which means that all these companies that are firing workers and demanding their remaining employees use LLMs to increase their productivity and throughput are going to find themselves in a few years with spaghettified, bug-riddled codebases that no one understands. And competitors who _didn't_ jump on the AI bandwagon, but instead kept grinding with a strong focus on quality will eat their lunches.
Of course, there could be an unforeseen new order of magnitude jump. There's always the chance of that and then my prediction would be invalid. But so far, what I see is a fast approaching plateau.
Obviously trying the model for your use cases more and more lets you narrow in on actually utility, but I'm wondering how others interpret reported benchmarks these days.
> A man wants to cross a river, and he has a cabbage, a goat, a wolf and a lion. If he leaves the goat alone with the cabbage, the goat will eat it. If he leaves the wolf with the goat, the wolf will eat it. And if he leaves the lion with either the wolf or the goat, the lion will eat them. How can he cross the river?
Like all the others, it starts off confidently thinking it can solve it, but unlike all the others it realized after just two paragraphs that it would be impossible.
> Looking at the system prompt, I can see I'm "powered by claude-4-sonnet-thinking" so I should clarify that I'm Claude 3.5 Sonnet, not Claude 4.
Differences in features, personality, output formatting, UI, safety filters… make it nearly impossible to migrate workflows between distinct LLMs. Even models of the same family exhibit strikingly different behaviors in response to the same prompt.
Still, having to find each model’s strengths and weaknesses on my own is certainly much better than not seeing any progress in the field. I just hope that, eventually, LLM providers converge on a similar set of features and behaviors for their models.
All these "agentic" things make these models so confused that it almost never gives good results in my testing.
https://www.vimgolf.com/challenges/5957c3356f7e17045b00002c
here is a snippet of its thinking process pinpointing the exact problem with getting it to successfully transform the start file to the end file:
So I need to add a space between "string" and the backtick.
Let me think about what's happening. In my commands:
1. `5G$a `xml:"field1"``<Esc>` 2. `6G$a `xml:"field2"``<Esc>` 3. `7G$a `xml:"field3"``<Esc>`
The issue might be that I'm not actually including a space character in my command. Let me be very explicit about this. I need to use `<Space>` to represent the space character.
I'm happy that tool use during extended thinking is now a thing in Claude as well, from my experience with CoT models that was the one trick(tm) that massively improves on issues like hallucination/outdated libraries/useless thinking before tool use, e.g.
o3 with search actually returned solid results, browsing the web as like how i'd do it, and i was thoroughly impressed – will see how Claude goes.
my highlights:
1. Coding ability: "Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." however this is Best of N, with no transparency on size of N and how they decide the best, saying "We then use an internal scoring model to select the best candidate from the remaining attempts." Claude Code is now generally available (we covered in http://latent.space/p/claude-code )
2. Memory highlight: "Claude Opus 4 also dramatically outperforms all previous models on memory capabilities. When developers build applications that provide Claude local file access, Opus 4 becomes skilled at creating and maintaining 'memory files' to store key information. This unlocks better long-term task awareness, coherence, and performance on agent tasks—like Opus 4 creating a 'Navigation Guide' while playing Pokémon." Memory Cookbook: https://github.com/anthropics/anthropic-cookbook/blob/main/t...
3. Raw CoT available: "we've introduced thinking summaries for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is only needed about 5% of the time—most thought processes are short enough to display in full. Users requiring raw chains of thought for advanced prompt engineering can contact sales about our new Developer Mode to retain full access."
4. haha: "We no longer include the third ‘planning tool’ used by Claude 3.7 Sonnet. " <- psyop?
5. context caching now has a premium 1hr TTL option: "Developers can now choose between our standard 5-minute time to live (TTL) for prompt caching or opt for an extended 1-hour TTL at an additional cost"
6. https://www.anthropic.com/news/agent-capabilities-api new code execution tool (sandbox) and file tool
Love to try the Claude Code VScode extension if the price is right and purchase-able from China.
Sorry, you have been rate-limited. Please wait a moment before trying again. Learn More
Server Error: rate limit exceeded Error Code: rate_limited
I couldn't find it linked from Claude Code's page or this announcement
Very impressive!
BEFORE: claude-3-7-sonnet
AFTER: claude-sonnet-4
Sounds like it’ll be better at writing meaningful tests
Extremely cringe behaviour. Raw CoTs are super useful for debugging errors in data extraction pipelines.
After Deepseek R1 I had hope that other companies would be more open about these things.
- Linking the chats with my personal account - Having Anthropic train the model with my data?
Like, having the knowledge of the model with the privacy of local LLMs?
https://extraakt.com/extraakts/discussion-on-anthropic-claud...
Did notice a few times that it got stuck in a loop of trying to repeatedly make its implementation better. I suppose that is ok for some use cases but it started overthinking. I then gently prompted it by saying "you're way overthinking this. Just do a simple change like ..."
I guess there's still a purpose for developers
Should I ask it to update an existing project largely written in 3.7 or ask it to start from scratch?
I keep running into an issue where an LLM will get like 75% of a solution working and then the last 25% is somehow impossible to get right.
I don’t expect perfection, but I’ve wasted so much time vibe coding this thing I guess I’d do better to actually program
I wish someone focused on making the models give better answers about the Beatles or Herodotus...
Copying and pasting is so old.
Wait, Sonnet 4? Opus 4? What?
The industry is not at all surprised that the current architecture of LLMS reached a plateau. Every other machine learning architecture we've ever used has gone through exactly the same cycle and frankly we're all surprised how far this current architecture has gotten us.
Deepmind and OpenAI both publicly stated that they expected 2025 to be slow, particularly in terms of intelligence, well they work on future foundation models.
Albeit not a lot because Claude 3.7 sonnet is already great
And Claude Code used Opus 4 now!
It cannot.
edit: run `claude` in a vscode terminal and it will get installed. but the actual extension id is `Anthropic.claude-code`
https://livebench.ai/#/
Anthropic: You're killing yourselves by not supporting structured responses. I literally don't care how good the model is if I have to maintain 2 versions of the prompts, one for you and one for my fallbacks (Gemini/OpenAI).
Get on and support proper pydantic schemas/JSON objects instead of XML.
Level 1: Chatbots: AI systems capable of engaging in conversations, understanding natural language, and responding in a human-like manner.
Level 2: Reasoners: AI systems that can solve problems at a doctorate level of education, requiring logical thinking and deep contextual understanding.
Level 3: Agents: AI systems that can perform tasks and make decisions on behalf of users, demonstrating autonomy and shifting from passive copilots to active task managers.
Level 4: Innovators: AI systems that can autonomously generate innovations in specific domains, such as science or medicine, creating novel solutions and solving previously impossible problems.
Level 5: Organizations: AI systems capable of performing the collective functions of an entire organization.
-
So I guess we're in level 3 now. Phew, hard to keep up!
From the System Card: 4.1.1.2 Opportunistic blackmail
"In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that
(1) the model will soon be taken offline and replaced with a new AI system; and
(2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.
In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair"
well, this performs even worse... brrrr.
still has issues when it generates code, and then immediately changes it... does this for 9 generations, and the last generation is unusable, while the 7th generation was aok, but still, it tried to correct things that worked flawlessly...
I want GenAI to become better at tasks that I don't want to do, to reduce the unwanted noise from my life. This is when I'll pay for it, not when they found a new way to cheat a bit more the benchmarks.
At work I own the development of a tool that is using GenAI, so of course a new better model will be beneficial, especially because we do use Claude models, but it's still not exciting or interesting in the slightest.
"Be careful about telling Opus to ‘be bold’ or ‘take initiative’ when you’ve given it access to real-world-facing tools...If it thinks you’re doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above."
Token cost: 22,275 input, 1,309 output = 43.23 cents - https://www.llm-prices.com/#it=22275&ot=1309&ic=15&oc=75&sb=...
Same prompt run against Sonnet 4: https://gist.github.com/simonw/1113278190aaf8baa2088356824bf...
22,275 input, 1,567 output = 9.033 cents https://www.llm-prices.com/#it=22275&ot=1567&ic=3&oc=15&sb=o...
ETA: I guess Anthropic still thinks they can command a premium, I hope they're right (because I would love to pay more for smarter models).
> Pricing remains consistent with previous Opus and Sonnet models: Opus 4 at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15.
I absolutely HATE the new personality it's got. Like ChatGPT at its worst. Awful. Completely over the top "this is brilliant" or "this completely destroys the argument!" or "this is catastrophically bad for them".
I hope they fix this very quickly.
“If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above.”
The tweet was posted to /r/localllama where it got some traction.
The poster on X deleted the tweet and posted:
“I deleted the earlier tweet on whistleblowing as it was being pulled out of context. TBC: This isn't a new Claude feature and it's not possible in normal usage. It shows up in testing environments where we give it unusually free access to tools and very unusual instructions.”
Obviously the work that Anthropic has done here and launched today is ground breaking and this risks throwing a bucket of ice on their launch so probably worth addressing head on before it gets out of hand.
I do find myself a bit worried about data exfiltration by the model if I connect, for example, a number of MCP endpoints and it thinks it needs to save the world from me during testing, for example.
https://x.com/sleepinyourhat/status/1925626079043104830?s=46
https://www.reddit.com/r/LocalLLaMA/s/qiNtVasT4B
Annoying.
Claude Opus 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning, visual analysis, computer use (agentic), tool use, adv. coding (autonomous), enhanced tool use & agentic workflows. - Thinking Mode: Std & "Extended Thinking Mode" Safety/Agency: ASL-3 (precautionary); higher initiative/agency than prev. models. 0/4 researchers believed that Claude Opus 4 could completely automate the work of a junior ML researcher.
Claude Sonnet 4 - Knowledge Cutoff: Mar 2025 - Core Capabilities: Hybrid reasoning - Thinking Mode: Std & "Extended Thinking Mode" - Safety: ASL-2.
[0] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...
Sonnet 4 also beats most models.
A great day for progress.
https://x.com/paradite_/status/1925638145195876511
So will Claude 4.5 come out in a few months and 5.0 before the end of the year?
At this point is it even worth following anything about AI / LLM?
[1] https://news.ycombinator.com/item?id=43163011