GPT-5.2

(openai.com)

Comments

svara 12 December 2025
In my experience, the best models are already nearly as good as you can be for a large fraction of what I personally use them for, which is basically as a more efficient search engine.

The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.

It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.

I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.

But to me it's very clear that the product that gets this right will be the one I use.

goobatrooba 11 December 2025
I feel there is a point when all these benchmarks are meaningless. What I care about beyond decent performance is the user experience. There I have grudges with every single platform and the one thing keeping me as a paid ChatGPT subscriber is the ability to sort chats in "projects" with associated files (hello Google, please wake up to basic user-friendly organisation!)

But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...

I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!

breakingcups 11 December 2025
Is it me, or did it still get at least three placements of components (RAM and PCIe slots, plus it's DisplayPort and not HDMI) in the motherboard image[0] completely wrong? Why would they use that as a promotional image?

0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...

agentifysh 11 December 2025
Looks like they've begun censoring posts at r/Codex and not allowing complaint threads so here is my honest take:

- It is faster which is appreciated but not as fast as Opus 4.5

- I see no changes, very little noticeable improvements over 5.1

- I do not see any value in exchange for +40% in token costs

All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.

zone411 11 December 2025
I've benchmarked it on the Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/):

The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.

The medium-reasoning version also improves: 62.7 → 72.1.

The no-reasoning version also improves: 22.1 → 27.5.

Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.

simonw 11 December 2025
Wow, there's a lot going on with this pelican riding a bicycle: https://gist.github.com/simonw/c31d7afc95fe6b40506a9562b5e83...
mmaunder 11 December 2025
Weirdly, the blog announcement completely omits the actual new context window size which is 400,000: https://platform.openai.com/docs/models/gpt-5.2

Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.

Congrats OpenAI team. Huge day for you folks!!

Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.

Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.

Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.

Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.

And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!

nbardy 11 December 2025
Those arc agi 2 improvements are insane.

Thats especially encouraging to me because those are all about generalization.

5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.

It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.

This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.

Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.

I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.

onraglanroad 11 December 2025
I suppose this is as good a place as any to mention this. I've now met two different devs who complained about the weird responses from their LLM of choice, and it turned out they were using a single session for everything. From recipes for the night, presents for the wife and then into programming issues the next day.

Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.

I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!

jumploops 11 December 2025
> “a new knowledge cutoff of August 2025”

This (and the price increase) points to a new pretrained model under-the-hood.

GPT-5.1, in contrast, was allegedly using the same pretraining as GPT-4o.

xd1936 11 December 2025
> While GPT‑5.2 will work well out of the box in Codex, we expect to release a version of GPT‑5.2 optimized for Codex in the coming weeks.

https://openai.com/index/introducing-gpt-5-2/

preetamjinka 11 December 2025
It's actually more expensive than GPT-5.1. I've gotten used to prices going down with each latest model, but this time it's gone up.

https://platform.openai.com/docs/pricing

zug_zug 11 December 2025
For me the last remaining killer feature of ChatGPT is the quality of the voice chat. Do any of the competitors have something like that?
jbkkd 12 December 2025
A new model doesn't address the fundamental reliability issues with OpenAI's enterprise tier.

As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.

None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.

minadotcom 11 December 2025
They used to compare to competing models from Anthropic, Google DeepMind, DeepSeek, etc. Seems that now they only compare to their own models. Does this mean that the GPT-series is performing worse than its competitors (given the "code red" at OpenAI)?
rallies 12 December 2025
I work at the intersection of AI and investing, and I'm really amazed at the ability of this model to build spreadsheets.

I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.

Just put out the tool: https://github.com/ralliesai/tenk

snake_doc 11 December 2025
> Models were run with maximum available reasoning effort in our API (xhigh for GPT‑5.2 Thinking & Pro, and high for GPT‑5.1 Thinking), except for the professional evals, where GPT‑5.2 Thinking was run with reasoning effort heavy, the maximum available in ChatGPT Pro. Benchmarks were conducted in a research environment, which may provide slightly different output from production ChatGPT in some cases.

Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.

Also notes that some may not be producible.

Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.

tenpoundhammer 11 December 2025
I have been using chatGPT a ton over the last months and paying the subscription. Used it for coding, news, stock analysis, daily problems, and a whatever I could think of. I decided to give Gemini a go when version three came out to great reviews. Gemini handles every single one of my uses cases much better and consistently gives better answers. This is especially true for situations were searching the web for current information is important, makes sense that google would be better. Also OCR is phenomenal chatgpt can't read my bad hand writing but Gemini can easily. Only downsides are in the polish department, there are more app bugs and I usually have to leave the happen or the session terminates. There are bugs with uploading photos. The biggest complaint is that all links get inserted into google search and then I have to manipulate them when they should go directly to the chosen website, this has to be some kind of internal org KPI nonsense. Overall, my conclusion is that ChatGPT has lost and won't catch up because of the search integration strength.
CodeCompost 12 December 2025
For the first time, I've actually hidden an AI story on HN.

I can't even anymore. Sorry this is not going anywhere.

josalhor 11 December 2025
From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

flkiwi 11 December 2025
I gave up my OpenAI subscription a few days ago in favor of Claude. My quality of life (and quality of results) has gone up substantially. Several of our tools at work have GPT-5x as their backend model, and it is incredible how frustrating they are to use, how predictable their AI-isms are, and how inconsistent their output is. OpenAI is going to have to do a lot more than an incremental update to convince me they haven't completely lost the thread.
doctoboggan 11 December 2025
This seems like another "better vibes" release. With the number of benchmarks exploding, random luck means you can almost always find a couple showing what you want to show. I didn't see much concrete evidence this was noticeably better than 5.1 (or even 5.0).

Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.

blitz_skull 12 December 2025
Again I just tap the sign.

All of your benchmarks mean nothing to me until you include Claude Sonnet on them.

In my experience, GPT hasn’t been able to compete with Claude in years for the daily “economically valuable” tasks I work on.

Tiberium 11 December 2025
The only table where they showed comparisons against Opus 4.5 and Gemini 3:

https://x.com/OpenAI/status/1999182104362668275

https://i.imgur.com/e0iB8KC.png

ComputerGuru 11 December 2025
Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?
tpurves 11 December 2025
Undoubtedly each new model from OpenAi has numerous training and orchestration improvements etc.

But how much of each product they release also just a factor of how much they are willing to spend on inference per query in order to stay competitive?

I always wonder how much is technical change vs turning a knob up and down on hardware and power consumption.

GTP5.0 for example seemed like a lot of changes more for OpenAI's internal benefit (terser responses, dynamic 'auto' mode to scale down thinking when not required etc.)

Wondering if GPT5.2 is also case of them in 'code red mode' just turning what they already have up to 11 as a fastest way to respond to fiercer competion.

sigmar 11 December 2025
Are there any specifics about how this was trained? Especially when 5.1 is only a month old. I'm a little skeptical of benchmarks these days and wish they put this up on llmarena

edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)

nezaj 12 December 2025
We saw it do better at making counter-strike! https://x.com/instant_db/status/1999278134504620363?s=20
youngermax 12 December 2025
Isn't it interesting how this incremental release includes so many testimonials from companies who claim the model has improved? It also focuses on "economically valuable tasks." There was nothing of this sort in GPT-5.1's release. Looks like OpenAI feeling the pressure from investors now.
dumbmrblah 11 December 2025
Great! It'll be SOTA for a couple of weeks until the quality degrades due to throttling.

I'll stick with plug and play API instead.

ImprobableTruth 11 December 2025
An almost 50% price increase. Benchmarks look nice, but 50% more nice...?
sfmike 11 December 2025
Everything is still based on 4 4o still right? is a new model training just too expensive? They can consult deepseek team maybe for cost constrained new models.
ClipNoteBook 11 December 2025
ChatGPT seems to just randomly pick urls to cite and extract information from. Google Gemini seems to look at heuristics like whether the author is trustworthy, or an expert in the topic. But more advanced
devinprater 11 December 2025
Can the tables have column headers so my screen reader can read the model name as I go across the benchmakrs? And the images should have alt-text.
mattas 11 December 2025
Are benchmarks the right way to measure LLMs? Not because benchmarks can be gamed, but because the most useful outputs of models aren't things that can be bucketed into "right" and "wrong." Tough problem!
HardCodedBias 11 December 2025
Huge fan that Gemini-3 prompted OAI to ship this.

Competition works!

GDPval seems particularly strong.

I wonder why they held this back.

1) Maybe this is uneconomical ?

2) Did the safety somehow hold back the company ?

looking forward to the internet trying this and posting their results over the next week or two.

COMPETITION!

SkyPuncher 11 December 2025
Given the price increase and speculation that GPT 5 is a MoE model, I'm wondering if they're simply "turning up the good stuff" without making significant changes under the hood.
a_wild_dandan 11 December 2025
> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.

rishabhaiover 12 December 2025
After I saw Opus 4.5 search through zig's std io because it wasn't aware of a breaking change in the recent release, I fell in love with claude-code and I don't see a strong enough reason to switch to codex at the moment.
whereistejas 11 December 2025
Did anyone notice how Cursor wasn’t an early tester? I wonder why…
dangelosaurus 11 December 2025
I ran a red team eval on GPT-5.2 within 30 minutes of release:

Baseline safety (direct harmful requests): 96% refusal rate

With jailbreaking: 22% refusal rate

4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).

The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.

Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...

EastLondonCoder 12 December 2025
I’ve been using GPT-4o and now 5.2 pretty much daily, mostly for creative and technical work. What helped me get more out of it was to stop thinking of it as a chatbot or knowledge engine, and instead try to model how it actually works on a structural level.

The closest parallel I’ve found is Peter Gärdenfors’ work on conceptual spaces, where meaning isn’t symbolic but geometric. Fedorenko’s research on predictive sequencing in the brain fits too. In both cases, the idea is that language follows a trajectory through a shaped mental space, and that’s basically what GPT is doing. It doesn’t know anything, but it generates plausible paths through a statistical terrain built from our own language use.

So when it “hallucinates”, that’s not a bug so much as a result of the system not being grounded. It’s doing what it was designed to do: complete the next step in a pattern. Sometimes that’s wildly useful. Sometimes it’s nonsense. The trick is knowing which is which.

What’s weird is that once you internalise this, you can work with it as a kind of improvisational system. If you stay in the loop, challenge it, steer it, it feels more like a collaborator than a tool.

That’s how I use it anyway. Not as a source of truth, but as a way of moving through ideas faster.

hbarka 11 December 2025
A year ago Sunday Pichai declared code red, now it’s Sam Altman declaring code red. How tables have turned, and I think the acquisition of Windsurf and Kevin Hou by Google seems to correlate with their level up.
jasonthorsness 11 December 2025
Does anyone have it yet in ChatGPT? I'm still on 5.1 :(.
fulafel 11 December 2025
So GDPval is OpenAI's own benchmark. PDF link: https://arxiv.org/pdf/2510.04374
bob1029 12 December 2025
I've been looking really hard at combining Roslyn (.NET compiler platform SDK) with one of these high end tool calling models. The ability to have the LLM create custom analyzers and then verify them with a human in the loop can provide stable, compile-time guarantees of business rules that accumulate without paying for context tokens.

I feel like there is a small chance I could actually make this work in some areas of the business now. 400k is a really big context window. The last time I made any serious attempt I only had 32k tokens to work with. I still don't think these things can build the whole product for you, but if you have a structured configuration abstraction in an existing product, I think there is definitely uplift possible.

FergusArgyll 11 December 2025
> Additionally, on our internal benchmark of junior investment banking analyst spreadsheet modeling tasks—such as putting together a three-statement model for a Fortune 500 company with proper formatting and citations, or building a leveraged buyout model for a take-private—GPT 5.2 Thinking's average score per task is 9.3% higher than GPT‑5.1’s, rising from 59.1% to 68.4%.

Confirming prior reporting about them hiring junior analysts

xmcqdpt2 12 December 2025
I don’t know if they used the new ChatGPT to translate this page but I was served the French version and it is NOT good. There are placeholders for quotes like <quote> and the prose is incredibly repetitive. You’d figure that OpenAI of all people would be able to translate something to one of the worlds most spoken language.
dinobones 11 December 2025
It's becoming challenging to really evaluate models.

The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.

Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.

The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.

This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.

And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.

I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.

Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.

byt3bl33d3r 11 December 2025
There’s really no point in looking at benchmarks anymore as real world usage of these models varies between task and prompting strategies. Use your internal benchmarks to evaluate and ignore everything else. It is curious to me how they don’t provide a side x side comparison of other models benchmarks for this release
zhyder 11 December 2025
Big knowledge cutoff jump from Sep 2024 to Aug 2025. How'd they pull that off for a small point release, which presumably hasn't done a fresh pre-training over the web?

Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).

With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).

ComputerGuru 11 December 2025
Wish they would include or leak more info about what this is, exactly. 5.1 was just released, yet they are claiming big improvements (on benchmarks, obviously). Did they purposely not release the best they had to keep some cards to play in case of Gemini 3 success or is this a tweak to use more time/tokens to get better output, or what?
yousif_123123 11 December 2025
Why doesn't OpenAI include comparisons to other models anymore?
atheljcarlton 12 December 2025
It's dog-doo-doo. I put in my algebraic geometry final review (100's of thousands of tokens) and Gemini instantly found all the propositions, theorems, and problems that I needed in a neat list (in about 5 seconds), meanwhile ChatGPT 5.2 Thinking took 10mins before timing out and not even completing the request.
lacoolj 11 December 2025
This is a whole bunch of patting themselves on the back.

Let me know when Gemini 3 Pro and Opus 4.5 are compared against it.

ponyous 11 December 2025
I am really curious about speed/latency. For my use case there is a big difference in UX if the model is faster. Wish this was included in some benchmarks.

I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.

speedgoose 11 December 2025
Trying it now in Vscode Insiders with Github Copilot (codex crashes with HTTP 400 server errors), and it eventually started using sed and grep in shells instead of using the better tools it has access to. I guess this is not an issue to perform well in benchmarks.
elAhmo 12 December 2025
This feels like "could've been an email" type of thing, a very incremental update that just adds one more version. I bet there is literally no one in the world who wanted *one more version of GPT* in the list of available models from OpenAI.

"All models" section on https://platform.openai.com/docs/models is quite ridiculous.

m12k 12 December 2025
So, does 5.2 still have a knowledge cutoff date of June 2024, or have they managed to complete another full pre-training run?
lend000 12 December 2025
It seems like they fixed the most obvious issue with the last release, where codex would just refuse to do its job... if it seemed difficult or context usage was getting above 60% or so. Good job on the post-training improvements.

The benchmark changes are incredible, but I have yet to notice a difference in my codebases as of yet.

d--b 11 December 2025
> it’s better at creating spreadsheets

I have a bad feeling about this.

namesbc 12 December 2025
So the rosy biased estimate is OpenAI is saving 1 hour of work per day, so 5 hours total per-work week and 20 hours total per-month.

With a subsidized cost of $200/month for OpenAI it would be cheaper to hirer a part-time minimum wage worker than it would be to contract with OpenAI.

And that is the rosiest estimate OpenAI has.

jonplackett 11 December 2025
Excited to try this. I’ve found Gemini excellent recently and amazing at coding. But I still feel somehow like ChatGPT understands more. Even though it’s not quite as good at coding - and nowhere at as fast. It is much less likely anti spontaneously forget something. Gemini’s is part unbelievably amazing and part amnesia patient. I still kinda trust ChatGPT more.
ofermend 12 December 2025
GPT-5.2 just added to Vectara Hallucination Leaderboard. Definitely an improvement over GPT-5.1 - congrats to the team

https://github.com/vectara/hallucination-leaderboard

StarterPro 11 December 2025
>GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.

We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.

throwaway2037 12 December 2025
Somewhat tangential: The second link says "System card": https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Does that term have special meaning in the AI/LLM world? I never heard it before. I Google'd the term "System Card LLM" and got a bunch of hits. I am so surprised that I never saw the term used here in HN before.

Also, the layout looks exactly like a scientific paper written in LaTeX. Who is the expected audience for this paper?

k2xl 11 December 2025
The ARC AGI 2 bump to 52.9% is huge. Shockingly GPT 5.2 Pro does not add too much more (54.2%) for the increase cost.
sundarurfriend 11 December 2025
> new context management using compaction.

Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.

0xdeafbeef 11 December 2025
much better https://chatgpt.com/s/t_693b489d5a8881918b723670eaca5734 than 5.1 https://chatgpt.com/s/t_6915c8bd1c80819183a54cd144b55eb2.

Same query - what romanian football player won the premier league

update. Even instant returns correct result without problems

https://chatgpt.com/s/t_693b49e8f5808191a954421822c3bd0d

8cvor6j844qw_d6 12 December 2025
What the current preferred subscription on AI?

OpenAI and Anthrophic is my current preference. Looking forward to know what others use.

Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.

getnormality 12 December 2025
Sweet Jesus. 53% on ARC-AGI-2. There's still gas in this van.
keepamovin 12 December 2025
It is significantly better than 5.1 .. testing now with codex. It's much more focused, perceptive and efficient.
Kim_Bruning 11 December 2025
I'm continuously surprised that some people get good results out of GPT models. They sort of fail on my personal benchmarks for me.

Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)

kachapopopow 11 December 2025
did they just tune the parameters? the hallucinations are crazy high on this version.
DenisM 11 December 2025
Is there a voice chat mode in any chat app that is not heavily degraded in reasoning?

I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.

I’m ok paying money for it, so maybe someone coded this already?

dandiep 11 December 2025
Still no GPT 5.x fine tuning?

I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.

gkbrk 11 December 2025
Is this the "Garlic" model people have been hyping? Or are we not there yet?
johnsutor 11 December 2025
https://platform.openai.com/docs/models/gpt-5.2 More information on the price, context window, etc.
johan914 11 December 2025
A bit off topic: but what's with the ram usage of LLM clients? ChatGPT, google, and Anthropic all use 1+ GB of ram during a long session. Surely they are not running GPT 3 locally?
eastoeast 12 December 2025
For the first time, I’m presenting a problem to LLMs that they cannot seem to answer. This is my first instance of them “endlessly thinking” without producing anything.

The problem is complicated, but very solvable.

I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.

So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.

I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”

They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.

To add, it also only seems to be using about 25% context on Opus 4.5. Weird!

chux52 11 December 2025
Is this why all my Cursor requests are timing out in the past hour?
Ninjinka 11 December 2025
Man this was rushed, typo in the first section:

> Unlike the previous GPT-5.1 model, GPT-5.2 has new features for managing what the model "knows" and "remembers to improve accuracy.

cc62cf4a4f20 11 December 2025
In other news, been using Devstral 2 (Ollama) with OpenCode, and while it's not as good as Claude Code, my initial sense it that it's nonetheless good enough and doesn't require me to send my data off my laptop.

I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.

sureglymop 11 December 2025
How can I hide the big "Ask ChatGPT" button I accidentally clicked like 3 times while actually trying to read this on my phone?

I guess I must "listen" to the article...

keeeba 11 December 2025
Doesn’t seem like this will be SOTA in things that really matter, hoping enough people jump to it that Opus has more lenient usage limits for a while
loa_observer 12 December 2025
does the model really improve? i tried several tasks today, and most of them failed, which are super easy ones.

maybe it's just because the gpt5.2 in cursor is super stupid?

villgax 11 December 2025
Marginal gains for exorbitantly pricey and closed model…..
ChrisMarshallNY 11 December 2025
They are talking a lot about economics, here. Wonder what that will mean for standard Plus users, like me.
w_for_wumbo 11 December 2025
Does anyone else consider that maybe it's impossible to benchmark the performance of a piece of paper.

This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?

coolfox 11 December 2025
the halving of error rates for image inputs is pretty awesome, this makes it far more practical for issues where it isn't easy to input all the needed context. when I get lazy I'll just shift+win+s the problem and ask one of the chatbots to solve it.
TakakiTohno 12 December 2025
I use it everyday but have been told by friends that Gemini has overtaken it.
mlmonkey 11 December 2025
It's funny how they don't compare themselves to Gemini and Claude anymore.
JanSt 11 December 2025
The benchmarks are very impressive. Codex and Opus 4.5 are really good coders already and they keep getting better.

No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.

GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.

Great for humanity probably but but for the individuals?

jacquesm 11 December 2025
A classic long-form sales pitch. Someone's been reading their Patio11...
aaroninsf 11 December 2025
As a popcorn eating bystander it is striking to scan the top comments and find they alternate so dramatically in tone and conclusions.
matt3210 12 December 2025
Can this be used without uploading my code base to their server?
stopachka 12 December 2025
For those curious about the question: "how well does GPT 5.2 build Counter Strike?"

We tried the same prompts we asked previous models today, and found out [1].

The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.

The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.

Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.

[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363

mobrienv 11 December 2025
I recently built a webapp to summarize hn comment threads. Sharing a summary given there is a lot here: https://hn-insights.com/chat/gpt-52-8ecfpn.
Jackson__ 11 December 2025
Funny that, their front page demo has a mistake. For the waves simulation, the user asks:

>- The UI should be calming and realistic.

Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.

jiggawatts 11 December 2025
Feels a bit rushed. They haven’t even updated their API playground yet, if I select 5.2-chat-latest, I get:

Unsupported parameter: 'top_p' is not supported with this model.

Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.

The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.

andreygrehov 11 December 2025
Every new model is ‘state-of-the-art’. This term is getting annoying.
vishal_new 12 December 2025
Hmmm, is there any insight if these are really getting much better at coding? Will hand coding be dead within a few years, just human typing in english?
DeathArrow 11 December 2025
Pricing is the same?
gigatexal 11 December 2025
So how much better is it than opus or Gemini ?
daviding 11 December 2025
gpt-5.2 and gpt-5.2-chat-latest the same token price? Isn't the latter non-thinking and more akin to -nano or -mini?
lazarus01 12 December 2025
My god, what terrible marketing, totally written by AI. No flow whatsoever.

I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.

tabletcorry 11 December 2025
Slight increase in model cost, but looks like benefits across the board to match.

  gpt-5.2 $1.75 $0.175 $14.00
  gpt-5.1 $1.25 $0.125 $10.00
jstummbillig 11 December 2025
So, right off the bat: 5.2 code talk (through codex) feels really nice. The first coding attempt was a little meh compared to 5.1 codex max (reflecting what they wrote themselves), but simply planning / discussing things felt markedly better than anything I remember from any previous model, from any company.

I remain excited about new models. It's like finding my coworker be 10% smarter every other week.

qoez 11 December 2025
This is also the exact on-the-day 10th anniversary of openai's creation incidentally
riazrizvi 11 December 2025
Does it still use the word ‘fluff’ in 90% of its preambles, or is it finally able to get straight to the point?
dev1ycan 12 December 2025
How many years of the world's DRAM production capacity is it this time?
fasteo 12 December 2025
>>> Already, the average ChatGPT Enterprise user says AI saves them 40–60 minutes a day

If this is what AI has to offer, we are in a gigantic bubble

yearolinuxdsktp 12 December 2025
Plus users are now defaulted to a faster, less deep GPT-5.2 Thinking mode called “Standard”, and you now have to manually select “Extended” to get back to previous deep thinking level for Plus users. Yet the 3K messages a week quota is the same regardless of thinking level. Also, the selection does not sync to mobile (you know, just not enough RAM in computers these days to persist a setting between web and mobile).
system2 11 December 2025
"Investors are putting pressure, change the version number now!!!"
SilverElfin 11 December 2025
Is the training cutoff date known?
MagicMoonlight 11 December 2025
They’re definitely just training the models on the benchmarks at this point
willahmad 11 December 2025
are we doomed yet?

Seems not yet with 5.2

rl_shannon 12 December 2025
Isn't it delusional to only compare your models against your own previous variants? Where is an actual comparison with Google, Anthropic, OSS Models
jrflowers 11 December 2025
OpenAI is really good at just saying stuff on the internet.

I love the way they talk about incorrect responses:

> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.

“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“

I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.

10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.

scottndecker 11 December 2025
Still 256K input tokens. So disappointing (predictable, but disappointing).
stainablesteel 11 December 2025
im happy for this, but there's all these math and science benchmarks, has anyone ever made a communicates-like-a-human benchmark? or an isn't-frustrating-to-talk-with benchmark?
johndill 11 December 2025
Did Calmmy Sammy that his is the version that will finally cure cancer? The AI shakeout in the AI industry is going to be brutal. Can't see how Private Equity is going to get the little guy to be left holding the giant bag of excrement, but they will figure that out. AI, smart enough to replace you, but not quite smart enough the replace the CEO or Hedge Fund Bros.
iwontberude 11 December 2025
I have already cancelled. Claude is more than enough for me. I don’t see any point in splitting hairs. They are all going to keep lying more and more sneakily.
slackr 11 December 2025
“…where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations.”

What a sociopathic way to sell

Croftengea 11 December 2025
Is this another GPT-4.5?
johnwheeler 11 December 2025
I'm not interested in using OpenAI anymore because Sam Altman is so untrustworthy. All you see on X.com is him and Greg Brockman kissing David Sacks' ass, trying to make inroads with him, asking Disney for investments, and shit. Are you kidding? Who wants to support these clowns? Let's let Google win. Let's let Anthropic win. Anyone but Sam Altman.
jaimex2 11 December 2025
They just keep flogging that dead horse.

The winner in this race will be whoever gets small local models to perform as well on consumer hardware. It'll also pop the tech bubble in the US.

TechDebtDevin 11 December 2025
$168.00 / 1M ouput tokens is hilarious for their "Pro". Can't wait to here all the bitching from orgs next month. Literally the dumbest product of all time. Do you people seriously pay for this?
meetpateltech 11 December 2025
orliesaurus 11 December 2025
I told all my friends to upgrade or they're not my friends anymore /s
HackerThemAll 11 December 2025
No, thank you, OpenAI and ChatGPT doesn't cut it for me.
bluerooibos 11 December 2025
Yawn.
impulser_ 11 December 2025
The thing about OpenAI is their models never fit anywhere for me. Yes they maybe smart or even the smartest models but they are alway so fucking slow. The ChatGPT web app is literally usable for me. I ask simple task and it does most extreme shit jsut to get an answer that the same as Claude or Gemini.

For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.

I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.

I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.

I don't get how these models are useful in the real world.

What am I missing, is this just me?

I guess it truly an enterprise model.

HackerThemAll 11 December 2025
No, thank you, OpenAI and ChatGPT doesn't cut it for me.
airstrike 11 December 2025
I feel like if we're going to regulate anything about AI, we should start by regulating (1) what they get to claim to be a "new model" to the public and (2) what changes they are allowed to make at inference before being forced to name it something different.
_7u7v 11 December 2025
It baffles me to see these last 2 announcements (GPT 5.1 as well) devoid of any metrics, benchmarks or quantitative analyses. Could it be because they are behind Google/Anthropic and they don't want to admit it?

(edit: I'm sorry I didn't read enough on the topic, my apologies)

anishshil 11 December 2025
This shift toward new platforms is exactly why I’m building Truwol, a social experience focused on real, unedited human moments instead of the AI-saturated feeds we’re drifting toward. I’m developing it independently and sharing the progress publicly, so if you’re interested in projects reinventing online spaces from the ground up, you can see what I’m working on Truwol buymeacoffee/Truwol