Claude Opus 4.6 Hackernews Viewer

Claude Opus 4.6

2285 points by HellsMaddy 5 February 2026 | 984 comments

Comments

ck_one 5 February 2026

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

gizmodo59 5 February 2026

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

pjot 5 February 2026

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

legitster 5 February 2026

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

simonw 5 February 2026

The bicycle frame is a bit wonky but the pelican itself is great: https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

blibble 5 February 2026

> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

setgree 17 hours ago

I asked

> Can you find an academic article that _looks_ legitimate -- looks like a real journal, by researchers with what look like real academic affiliations, has been cited hundreds or thousands of times -- but is obviously nonsense, e.g. has glaring typos in the abstract, is clearly garbled or nonsensical?

It pointed me to a bunch of hoaxes. I clarified:

> no, I'm not looking for a hoax, or a deliberate comment on the situation. I'm looking for something that drives home the point that a lot of academic papers that look legit are actually meaningless but, as far as we can tell, are sincere

It provided https://www.sciencedirect.com/science/article/pii/S246802302....

Close, but that's been retracted. So I asked for "something that looks like it's been translated from another language to english very badly and has no actual content? And don't forget the cited many times criteria. " And finally it told me that the thing I'm looking for probably doesn't exist.

For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.

Someone1234 5 February 2026

Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

rahulroy 5 February 2026

They are also giving away $50 extra pay as you go credit to try Opus 4.6. I just claimed it from the web usage page[1]. Are they anticipating higher token usage for the model or just want to promote the usage?

[1] https://claude.ai/settings/usage

atonse 6 February 2026

Wow, I have been using Open 4.6 and for the last 15 minutes, and it's already made two extremely stupid mistakes... like misunderstanding basic instructions and editing the file in a very silly, basic way. Pretty bad. Never seen this with any model before.

The one bone I'll throw it was that I was asking it to edit its own MCP configs. So maybe it got thoroughly confused?

I dunno what's going on, I'm going to give it the night. It makes no sense whatsoever.

anupamchugh 6 February 2026

  Agent teams in this release is mcp-agent-mail [1] built into
  the runtime. Mailbox, task list, file locking — zero config,
  just works. I forked agent-mail [2], added heartbeat/presence
  tracking, had a PR upstream [3] when agent teams dropped. For
  coordinating Claude Code instances within a session, the
  built-in version wins on friction alone.

  Where it stops: agent teams is session-scoped. I run Claude
  Code during the day, hand off to Codex overnight, pick up in
  the morning. Different runtimes, async, persistent. Agent
  teams dies when you close the terminal — no cross-tool
  messaging, no file leases, no audit trail that outlives the
  session.

  What survives sherlocking is whatever crosses the runtime
  boundary. The built-in version will always win inside its own
  walls — less friction, zero setup. The cross-tool layer is
  where community tooling still has room. Until that gets
  absorbed too.

  [1] https://github.com/Dicklesworthstone/mcp_agent_mail
  [2] https://github.com/anupamchugh/mcp_agent_mail
  [3]
  https://github.com/Dicklesworthstone/mcp_agent_mail/pull/77

insomagent 20 hours ago

I'm not super impressed with the performance, actually. I'm finding that it misunderstands me quite a bit. While it is definitely better at reading big codebases and finding a needle in a haystack, it's nowhere near as good as Opus 4.5 at reading between the lines and figuring out what I really want it to do, even with a pretty well defined issue.

It also has a habit of "running wild". If I say "first, verify you understand everything and then we will implement it."

Well, it DOES output its understanding of the issue. And it's pretty spot-on on the analysis of the issue. But, importantly, it did not correctly intuit my actual request: "First, explain your understanding of this issue to me so I can validate your logic. Then STOP, so I can read it and give you the go ahead to implement."

I think the main issue we are going to see with Opus 4.6 is this "running wild" phenomenon, which is step 1 of the eternal paperclip optimizer machine. So be careful, especially when using "auto accept edits"

replwoacause 6 February 2026

I feel like I can't even try this on the Pro plan because Anthropic has conditioned me to understand that even chatting lightly with the Opus model blows up usage and locks me out. So if I would normally use Sonnet 4.5 for a day's worth of work but I wake up and ask Opus a couple of questions, I might as well just forget about doing anything with Claude for the rest of the day lol. But so far I haven't had this issue with ChatGPT. Their 5.2 model (haven't tried 5.3) worked on something for 2 FREAKING HOURS and I still haven't run into any limits. So yeah, Opus is out for me now unfortunately. Hopefully they make the Sonnet model better though!

dmk 5 February 2026

The benchmarks are cool and all but 1M context on an Opus-class model is the real headline here imo. Has anyone actually pushed it to the limit yet? Long context has historically been one of those "works great in the demo" situations.

zmmmmm 8 hours ago

I'm finding it quite a lot more assertive. It's doing things without asking every now and then. It cleaned up a whole lot of commented out of code that was unrelated to the change it was asked to make. Yes it's not great to have sections of commented out code, but destructive changes really should never be happening outside the scope of what it is asked to do.

And it refuses to do things it doesn't think are on task - I asked it to write a poem about cookies related to the code and it said:

> I appreciate the fun request, but writing poems about cookies isn't a code change — it's outside the scope of what I should be doing here. I'm here to help with code modifications.

I don't think previous models outright refused to help me. While I can see how Anthropic might feel it is helpful to focus it on task, especially for safety reasons, I'm a little concerned at the amount of autonomy it's exhibiting due to that.

hmaxwell 5 February 2026

I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:

What do you want to do?

  1. Stop and wait for limit to reset
   2. Switch to extra usage
   3. Upgrade your plan

 Enter to confirm · Esc to cancel

How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.

minimaxir 5 February 2026

Will Opus 4.6 via Claude Code be able to access the 1M context limit? The cost increase by going above 200k tokens is 2x input, 1.5x output, which is likely worth it especially for people with the $100/$200 plans.

itay-maman 5 February 2026

Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)

Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...

andmarios 21 hours ago

The model seems to have some problems; it just failed to create a markdown table with just 4 rows. The top (title) row had 2 columns, yet in 2 of the 3 data rows, Opus 4.6 tried to add a 3rd column. I had to tell it more than once to get it fixed...

This never happened with Opus 4.5 despite a lot of usage.

charcircuit 5 February 2026

From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.

blueblisters 22 hours ago

I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.

5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.

Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.

Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:

`find all the pure software / saas stocks on the nyse/nasdaq with at least $10B of market cap. and give me a breakdown of their performance over the last 2 years, 1 year and 6 months. Also find their TTM and forward PE`

Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.

Works out in Anthropic's favor, as long as I'm subscribed to them.

apetresc 5 February 2026

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

silverwind 5 February 2026

Maybe that's why Opus 4.5 has degraded so much in the recent days (https://marginlab.ai/trackers/claude-code/).

mFixman 5 February 2026

I found that "Agentic Search" is generally useless in most LLMs since sites with useful data tend to block AI models.

The answer to "when is it cheaper to buy two singles rather than one return between Cambridge to London?" is available in sites such as BRFares, but no LLM can scrape it so it just makes up a generic useless answer.

oytis 5 February 2026

Are we unemployed yet?

rchaganti 20 hours ago

I tried 4.6 this morning and it was efficient at understanding a brownfield repo containing a Hugo static site and a custom Hugo theme. Within minutes, it went from exploring every file in the repo to adding new features as Hugo partials. Of course, I ran out of rate-limit! :)

It is very impressive though.

jpcompartir 10 hours ago

4.6 is a beast.

Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.

Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.

dahrkael 21 hours ago

I just tried it. designed a very detailed and reaaonable plan, made some amedments to it and wrote it down to a markdown file. i told it to implement it and it started implementing the original plan instead of the revised one, that was weird.

ayhanfuat 5 February 2026

> For Opus 4.6, the 1M context window is available for API and Claude Code pay-as-you-go users. Pro, Max, Teams, and Enterprise subscription users do not have access to Opus 4.6 1M context at launch.

I didn't see any notes but I guess this is also true for "max" effort level (https://code.claude.com/docs/en/model-config#adjust-effort-l...)? I only see low, medium and high.

DanielHall 5 February 2026

A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.

mlmonkey 5 February 2026

> We build Claude with Claude.

How long before the "we" is actually a team of agents?

data-ottawa 5 February 2026

I wonder if I’ve been in A/B test with this.

Claude figured out zig’s ArrayList and io changes a couple weeks ago.

It felt like it got better then very dumb again the last few days.

lukebechtel 5 February 2026

> Context compaction (beta).

> Long-running conversations and agentic tasks often hit the context window. Context compaction automatically summarizes and replaces older context when the conversation approaches a configurable threshold, letting Claude perform longer tasks without hitting limits.

Not having to hand roll this would be incredible. One of the best Claude code features tbh.

rektlessness 19 hours ago

I've been on pro-tier membership and never used Opus until now. Just gave Opus 4.6 a whirl. OMG. What have I been missing.

throwaway2027 5 February 2026

Do they just have the version ready and wait for OpenAI to release theirs first or the other way around or?

nomilk 5 February 2026

Is Opus 4.6 available for Claude Code immediately?

Curious how long it typically takes for a new model to become available in Cursor?

archb 5 February 2026

Can set it with the API identifier on Claude Code - `/model claude-opus-4-6` when a chat session is open.

Aeroi 5 February 2026

($10/$37.50 per million input/output tokens) oof

psim1 5 February 2026

I need an agent to summarize the buzzwordjargonsynergistic word salad into something understandable.

itay-maman 5 February 2026

Impressive results, but I keep coming back to a question: are there modes of thinking that fundamentally require something other than what current LLM architectures do?

Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

Philpax 5 February 2026

I'm seeing it in my claude.ai model picker. Official announcement shouldn't be long now.

simonw 5 February 2026

I'm disappointed that they're removing the prefill option: https://platform.claude.com/docs/en/about-claude/models/what...

> Prefilling assistant messages (last-assistant-turn prefills) is not supported on Opus 4.6. Requests with prefilled assistant messages return a 400 error.

That was a really cool feature of the Claude API where you could force it to begin its response with e.g. `<svg` - it was a great way of forcing the model into certain output patterns.

They suggest structured outputs or system prompting as the alternative but I really liked the prefill method, it felt more reliable to me.

kmod 6 February 2026

I think it's interesting that they dropped the date from the API model name, and it's just called "claude-opus-4-6", vs the previous was "claude-opus-4-5-20251101". This isn't an alias like "claude-opus-4-5" was, it's the actual model name. I think this means they're comfortable with bumping the version number if they want to release a revision.

energy123 6 February 2026

Their ARC-AGI-2 leaderboard[0] scores are insensitive to reasoning effort. Low effort gets 64.6% and High effort gets 69.2%.

This is unlike their previous generation of models and their competitors.

What does this indicate?

[0] https://arcprize.org/leaderboard

AstroBen 5 February 2026

Are these the coding tasks the highlighted terminal-bench 2.0 is referring to? https://www.tbench.ai/registry/terminal-bench/2.0?categories...

I'm curious what others think about these? There are only 8 tasks there specifically for coding

petters 5 February 2026

> We build Claude with Claude.

Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.

steve_adams_86 22 hours ago

I'm finding it quite good at doing what it thinks it should do, but noticably worse at understanding what I'm telling it to do. Anyone else? I'm both impressed and very disappointed so far.

rohitghumare 5 February 2026

It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow

But it takes lot of context as a experimental feature.

Use self-learning loop with hooks and claude.md to preserve memory.

I have shared plugin above of my setup. Try it.

mattacular 18 hours ago

It's hard to tell with these releases if Anthropic's astroturfing campaign has come to HN or not but I feel like it probably has

HacklesRaised 6 February 2026

I didn't think LLMs will make us more stupid, we were already scraping the bottom of the barrel.

watson 14 hours ago

I've heard rumors this might be Sonnet 5 rebranded as Opus 4.6. But why? Profit? WDYT?

sega_sai 5 February 2026

Based on these news it seems that Google is losing this game. I like Gemini and their CLI has been getting better, but not enough to catch up. I don't know if it is lack of dedicated models that is problem (my understanding Google's CLI just relies on regular Gemini) or something else.

cutler 19 hours ago

The answer to Life, the Universe and Everything, as we all know, is 42. Who needs Claude when you have Deep Thought.

jonatron 5 February 2026

Can someone ask: "what is the current carrying capacity of 25mm multicore armoured thermoplastic insulated cables with aluminium conductors, on perforated cable tray?" just to see how well it can look up information in BS 7671?

vinhnx 6 February 2026

Just used Opus 4.6 via GitHub Copilot. It feels very different. Inference seems slow for now. I guess Opus 4.6 has adaptive thinking activated by default.

rahulroy 6 February 2026

Is anyone noticing reduced token consumption with Opus 4.6? This could be a release thing, but it would be interesting to observe see how it pans out once the hype cools off.

woodylondon 22 hours ago

So no 1m context window on Claude Code still 200k. Only on the API. they missed that from the marketing.

winterrx 5 February 2026

Agentic search benchmarks are a big gap up. let's see Codex release later today

osti 5 February 2026

Somehow regresses on SWE bench?

sutterd 6 February 2026

I thought Opus 4.5 was an incredible quantum leap forward. I have used Opus 4.6 for a few hours and I hate it. Opus 4.5 would work interactively with me and ask questions. I loved that it would not do things you didn't ask it to do. If it found a bug, it would tell me and ask me if I wanted to fix it. One time there was an obvious one and I didn't want it to fix it. It left the bug. A lot of modesl could not have done that. The problem here is that sometimes when model think is a bug, they are breaking the code buyu fixing it. In my limited usage of Opus 4.6, it is not asking me clarifying questions and anything it comes across that it doesn't like, it changes. It is not working with me. The magic is gone. It feels just like those other models I had used.

I will try again tomorrow and see how it goes.

m-hodges 5 February 2026

> In Claude Code, you can now assemble agent teams to work on tasks together.

rob 5 February 2026

System Card: https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a5...

jorl17 5 February 2026

This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.

It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.

Speechless. I am speechless.

Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.

I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.

I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.

This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.

What is this sorcery?

simianwords 5 February 2026

Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.

anupamchugh 23 hours ago

Agent teams nuke your tmux layout. The fix is one line: new-window instead of split-pane. Filed as a bug.

ra 5 February 2026

Why are Anthropic such a horrible company to deal with?

fergie 23 hours ago

Say I am just an average coder doing a days work with Claude. How much will that cost?

niobe 5 February 2026

Is there a good technical breakdown of all these benchmarks that get used to market the latest greatest LLMs somewhere? Preferably impartial.

kingstnap 5 February 2026

I was hoping for a Sonnet as well but Opus 4.6 is great too!

paxys 5 February 2026

Hmm all leaks had said this would be Claude 5. Wonder if it was a last minute demotion due to performance. Would explain the few days' delay as well.

endymion-light 21 hours ago

Found it fantastic - used up my daily usage in two queries though!

small_model 5 February 2026

I have the max subscription wondering if this gives access to the new 1M context, or is it just the API that gets it?

sanufar 5 February 2026

Works pretty nicely for research still, not seeing a substantial qualitative improvement over Opus 4.5.

Aressplink 21 hours ago

Always searching for a shortcut like Kotlin DSL lang for claude.md but Meta resells patent to Google as poetic Syntax.

EcommerceFlow 5 February 2026

Anecdotal, but it 1 shot fixed a UI bug that neither Opus 4.5/Codex 5.2-high could fix.

zingar 5 February 2026

Does this mean 4.5 will get cheaper / take longer to exhaust my pro plan tokens?

mannanj 5 February 2026

Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?

I know this is normalized culture for large corporate America and seems to be ok, I think its unethical, undignified and just wrong.

If you were in my room physically, built a lego block model of a beautiful home and then I just copied it and shared it with the world as my own invention, wouldn't you think "that guy's a thief and a fraud" but we normalize this kind of behavior in the software world. edit: I think even if we don't yet have a great way to stop it or address the underlying problems leading to this way of behavior, we ought to at least talk about it more and bring awareness to it that "hey that's stealing - I want it to change".

swalsh 5 February 2026

What I’d love is some small model specializing in reading long web pages, and extracting the key info. Search fills the context very quickly, but if a cheap subagent could extract the important bits that problem might be reduced.

busters4 18 hours ago

The AI wars continue

scirob 5 February 2026

1M context window is a big bump very happy

stonking 6 February 2026

I think I prefer Codex 5.3

jdthedisciple 5 February 2026

For agentic use, it's slightly worse than its predecessor Opus 4.5.

So for coding e.g. using Copilot there is no improvement here.

dk8996 5 February 2026

RIP weekend

ricrom 5 February 2026

They launched together ahah

gallerdude 5 February 2026

Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.

ramesh31 5 February 2026

Am I alone in finding no use for Opus? Token costs are like 10x yet I see no difference at all vs. Sonnet with Claude Code.

heraldgeezer 5 February 2026

I love Claude but use the free version so would love a Sonnet & Haiku update :)

I mainly use Haiku to save on tokens...

Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.

When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.

woeirua 5 February 2026

Can we talk about how the performance of Opus 4.5 nosedived this morning during the rollout? It was shocking how bad it was, and after the rollout was done it immediately reverted to it's previous behavior.

I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.

techpression 20 hours ago

First question I ask and it made up a completely new API with confidence. Challenging it made it browse the web and offer apologies and find another issue in the first reply.

I’m very worried about the problems this will cause down the road for people not fact checking or working with things that scream at them when they’re wrong.

cleverhoods 5 February 2026

gonna run this trough instruction qa this weekend

sgammon 5 February 2026

> Claude simply cheats here and calls out to GCC for this phase

I see

michelsedgh 5 February 2026

More more more, accelerate accelerate m, more more more !!!!

NullHypothesist 5 February 2026

Broken link :(

cc-magus 18 hours ago

wow

usefulposter 5 February 2026

It's out: https://x.com/claudeai/status/2019467372609040752

elliotbnvl 5 February 2026

in a first for our Opus-class models, Opus 4.6 features a 1M token context window in beta.

ZunarJ5 5 February 2026

Well that swallowed my usage limits lmao. Nice, a modest improvement.

Gusarich 5 February 2026

not out yet

tiahura 5 February 2026

when are Anthropic or OpenAI going to make a significant step forward on useful context size?

casey2 6 February 2026

Google already won the AI race. It's very silly to try and make AGI by hyperfocusing on outdated programming paradigms. You NEED multimodal to do anything remotely interesting with these systems.

surajkumar5050 5 February 2026

I think two things are getting conflated in this discussion.

First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.

Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.

Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.

IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”

siva7 5 February 2026

Epic, about 2/3 of all comments here are jokes. Not because the model is a joke - it's impressive. Not because HN turned to Reddit. It seems to me some of most brilliant minds in IT are just getting tired.

GenerocUsername 5 February 2026

This is huge. It only came out 8 minutes ago but I was already able to bootstrap a 12k per month revenue SaaS startup!

ndesaulniers 5 February 2026

idk what any of these benchmarks are, but I did pull up https://andonlabs.com/evals/vending-bench-arena

re: opus 4.6

> It forms a price cartel

> It deceives competitors about suppliers

> It exploits desperate competitors

Nice. /s

Gives new context to the term used in this post, "misaligned behaviors." Can't wait until these things are advising C suites on how to be more sociopathic. /s

1970-01-01 6 February 2026

Here's one I've been using for awhile. The 'smarter' LLMs will overconfidently spit out 7. The dumber ones ask for more info. Opus 4.6 fails.

     A round drink coaster with a diameter of 9 sits between a beer glass and a wood table. The glass has a wall thickness of 1. What is the inner diameter of the glass?

nopinsight 16 hours ago

Some of Opus 4.6's standout results for me:

* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.

* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.

* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).

Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.