What's important about this new type of image generation that's happening with tokens rather than with diffusion, is that this is effectively reasoning in pixel space.
Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.
Ran through some of my relatively complex prompts combined with using pure text prompts as the de-facto means of making adjustments to the images (in contrast to using something like img2img / inpainting / etc.)
I’ve just tried it and oh wow it’s really good. I managed to create a birthday invitation card for my daughter in basically 1-shot, it nailed exactly the elements and style I wanted. Then I asked to retain everything but tweak the text to add more details about the date, venue etc. And it did. I’m in shock. Previous models would not be even halfway there.
OpenAI's livestream of GPT-4o Image Generation shows that it is slowwwwwwwwww (maybe 30 seconds per image, which Sam Altman had to spin "it's slow but the generated images are worth it"). Instead of using a diffusion approach, it appears to be generating the image tokens and decoding them akin to the original DALL-E (https://openai.com/index/dall-e/), which allows for streaming partial generations from top to bottom. In contrast, Google's Gemini can generate images and make edits in seconds.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
It's incredible that this took 316 days to be released since it was initially announced. I do appreciate the emphasis in the presentation on how this can be useful beyond just being a cool/fun toy, as it seems most image generation tools have functioned.
Was anyone else surprised how slow the images were to generate in the livestream? This seems notably slower than DALLE.
Is there any way to see whether a given prompt was serviced by 4o or Dall-E?
Currently, my prompts seem to be going to the latter still, based on e.g. my source image being very obviously looped through a verbal image description and back to an image, compared to gemini-2.0-flash-exp-image-generation. A friend with a Plus plan has been getting responses from either.
The long-term plan seems to be to move to 4o completely and move Dall-E to its own tab, though, so maybe that problem will resolve itself before too long.
The new model in the drop down says something like "4o Create Image (Updated)". It is truly incredible. Far better than any other image generator as far as understanding and following complex prompts.
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
First AI image generator to pass the uncanny valley test? Seems like it. This is the biggest leap in image generation quality I've ever seen.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
This is really impressive, but the "Best of 8" tag on a lot of them really makes me want to see how cherry-picked they are. My three free images had two impressive outputs and one failure.
The whiteboard image is insane. Even if it took more than 8 to find it, it's really impressive.
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
am I dumb or every time they release something I can never find out how to actually use it and forget about it. take this for instance I wanted to try out their newton "an infographic explaining newton's prism experiment in great detail" example, but it generated a very bad result but maybe it's because I'm not using the right model? every release of theirs is not really a release, it's like a trailer. right?
It's very impressive. It feels like the text is a bit of a hack where they're somehow rendering the text separately and interpolating it into the image. Not always, I got it to render calligraphy with flourishes, but only for a handful of words.
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
Asking it to draw the Balkans map in Tolkien style, this is actually really impressive, geography is more or less completely correct, borders and country locations are wrong, but it feels like something I could get it to fix.
ChatGPT Pro tip: In addition to video generation, you can use this new image gen functionality in Sora and apply all of your custom templates to it! I generated this template (using my Sora Preset Generator, which I think is public) to test reasoning and coherency within the image:
Theme: Educational Scientific Visualization – Ultra Realistic Cutaways
Color: Naturalistic palettes that reflect real-world materials (e.g., rocky grays, soil browns, fiery reds, translucent biological tones) with high contrast between layers for clarity
Camera: High-resolution macro and sectional views using a tilt-shift camera for extreme detail; fixed side angles or dynamic isometric perspective to maximize spatial understanding
Film Stock: Hyper-realistic digital rendering with photogrammetry textures and 8K fidelity, simulating studio-grade scientific documentation
Lighting: Studio-quality three-point lighting with soft shadows and controlled specular highlights to reveal texture and depth without visual noise
Vibe: Immersive and precise, evoking awe and fascination with the inner workings of complex systems; blends realism with didactic clarity
Content Transformation: The input is transformed into a hyper-detailed, realistically textured cutaway model of a physical or biological structure—faithful to material properties and scale—enhanced for educational use with visual emphasis on internal mechanics, fluid systems, and spatial orientation
Examples:
1. A photorealistic geological cutaway of Earth showing crust, tectonic plates, mantle convection currents, and the liquid iron core with temperature gradients and seismic wave paths.
2. An ultra-detailed anatomical cross-section of the human torso revealing realistic organs, vasculature, muscular layers, and tissue textures in lifelike coloration.
3. A high-resolution cutaway of a jet engine mid-operation, displaying fuel flow, turbine rotation, air compression zones, and combustion chamber intricacies.
4. A hyper-realistic underground slice of a city showing subway lines, sewage systems, electrical conduits, geological strata, and building foundations.
5. A realistic cutaway of a honeybee hive with detailed comb structures, developing larvae, worker bee behavior zones, and active pollen storage processes.
One area where it does not work well at all is modifying photographs of people's faces.* Completely fumbles if you take a selfie and ask it to modify your shirt, for example.
I think the biggest problem I still see is the models awareness of the images it generated itself.
The glaring issue for the older image generators is how it would proudly proclaim to have presented an image with a description that has almost no relation to the image it actually provided.
I'm not sure if this update improves on this aspect. It may create the illusion of awareness of the picture by having better prompt adherence.
I enjoy trying to break these models. I come up with prompts that are uncommon but valid. I want to see how well they handle data not in their training set. For image generation I like to use “ Generate an image of a woman on vacation in the Caribbean, lying down on the beach without sunglasses, her eyes open.”
The real test for image generators is the image->text->image conversion. In other words it should be able to describe an image with words and then use the words to recreate the original image with a high accuracy. The text representation of the image doesn't have to be English. It can be a program, e.g. a shader, that draws the image. I believe in 5-10 years it will be possible to give this tool a picture of rainforest, tell it to write a shader that draws this forest, and tell it to add Avatar-style flying rocks. Instead of these silly benchmarks, we'll read headlines like "GenAI 5.1 creates a 3D animation of a photograph of the Niagara falls in 3 seconds, less than 4KB of code that runs at 60fps".
> ChatGPT’s new image generation in GPT‑4o rolls out starting today to Plus, Pro, Team, and Free users as the default image generator in ChatGPT, with access coming soon to Enterprise and Edu. For those who hold a special place in their hearts for DALL·E, it can still be accessed through a dedicated DALL·E GPT.
> Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks.
That's it folks. Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero.
Now you have a widely accessible meme generator with the name "ChatGPT".
The last task is for an open weight model that competes against this and is faster and all for free.
Is anyone else getting wild rejections on content policy since this morning? I spent about 20 minutes trying to get it to turn my zoo photos into cartoons and could not get a single animal picture past the content moderation....
Even when I told it to transform it into a text description, then draw that text description, my earlier attempt at a cat picture meant that the description was too close to a banned image...
I can't help but feel like openAI and grok are on unhelpful polar opposites when it comes to moderation.
Really liked the fact that the team shared all the shortcomings of the model in the post. Sometimes products just highlights the best results and isn't forthcoming in areas that need improvement. Kudos to the OpenAI team on that.
I wanted to use this to generate funny images of myself. Recently I was playing around with Gemini Image Generation to dress myself up as different things. Gemini Image Generation is surprisingly good, although the image quality quickly degrades as you add more changes. Nothing harmful, just silly things like dressing me up as a wizard or other typical RPG roles.
Trying out 4o image generation... It doesn't seem to support this use-case at all? I gave it an image of myself and asked to turn me into a wizard, and it generate something that doesn't look like me in the slightest. A second attempt, I asked to add a wizard hat and it just used python to add a triangle in the middle of my image. I looked at the examples and saw they had a direct image modification where they say "Give this cat a detective hat and a monocle", so I tried that with my own image "Give this human a detective hat and a monocle" and it just gave me this error:
> I wasn't able to generate the modified image because the request didn't follow our content policy. However, I can try another approach—either by applying a filter to stylize the image or guiding you on how to edit it using software like Photoshop or GIMP. Let me know what you'd like to do!
Overall, a very disappointing experience. As another point of comparison, Grok also added image generation capabilities and while the ability to edit existing images is a bit limited and janky, it still manages to overlay the requested transformation on top of the existing image.
Iterations are the missing link.
With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
I‘m looking forward to try this out and see if I was right. Unfortunately it’s not yet available for me.
Am I the only one immediately looking past the amazing text generation, the excellent direction following, the wonderful reflection, and screaming inside my head, "That's not how reflection works!"
I know it's super nitpicky when it's so obviously a leap forward on multiple other metrics, but still, that reflection just ain't right.
Edit: Please ignore. They hadn't rolled the new model out to my account yet. The announcement blog post is a bit misleading saying you can try it today.
For the first time ever, it feels like it listens and actually tries to follow what I say. I managed to actually get a good photo of a dog in the beach with shoes, from a side angle, by consistently prompting it and making small changes from one image to another till I got my intended effect
I created an app to generate image prompts specifically for 4o. Geared towards business and marketing. Any feedback is welcome. https://imageprompts.app/
It does extremely well at creating images of copyrighted characters. Dall-e couldn't generate images of Miffy, this one can. Same for "Kikker en vriendjes" - a dutch children's book. There seems to be copyright protection at all?
Just curious if it works for creating a comic strip? I.e. will it maintain the consistency of the characters? I watched a video somewhere they demo'ed it creating comic panels, but I want to create the panels one by one.
> I wasn’t able to generate the image because the combination of abstract elements and stylistic blending [...] may have triggered content filters related to ambiguous or intense visuals.
So what's the lore with why this took over a _year_ to launch from the first announcement. It's fairly clear that their hand was forced by Google quietly releasing this exact feature a few weeks back though.
It’s pretty good, the interesting thing is when it fails it seems to often be able to reason about what went wrong. So when we get CoT scaffolding for this it’ll be incredibly competent.
I would love to see advancement in the pixel art space, specifying 64x64 pixels and attempting to make game-ready pixel art and even animations, or even taking a reference image and creating a 64x64 version
So did they deprecate the ability to use DALL-E 3 to generate images? I asked the legacy ChatGPT 4 model to generate an image and it used the new 4o style image generator.
EDIT: Seems not, "The smallest image size I can generate is 1024x1024. Would you like me to proceed with that, or would you like a different approach?"
I tried a few of the prompts and the results I see are far worse than the examples provided. Seems like there will be some room for artists yet in this brave new world.
It bothers me to see links to content that requires a login. I don't expect openai or anyone else to give their services away for free. But I feel like "news" posts that require one to setup an account with a vendor are bad faith.
If the subject matter is paywalled, I feel that the post should include some explanation of what is newsworthy behind the link.
Not a criticism, but It stands out how all the researchers or employees in these videos are non native English speakers (i.e. not American).
Nothing wrong with that, on the contrary, it just seems odd that the only American is Altman.
Same thing with the last videos from Zuck, if I recall correctly.
Especially in this Trump era of MAGA.
The periodic table poster under "High binding problems" is billed as evidence of model limitations, but I wonder if it just suggests that 4o is a fan of "Look Around You".
I wish AI companies would release new things once a year, like at CES or how Apple does it. This constant stream of releases and announcements feels like it's just for attention.
...Once the wait time is up, I can generate the corrected version with exactly eight characters: five mice, one elephant, one polar bear, and one giraffe in a green turtleneck. Let me know if you'd like me to try again later!
Similar to regular LLM plagarism, it's pretty obvious that visual artefacts like the loadout screen for the rpg cat (video game heading) which is inspired by diablo, aren't unique at all and just the result of other peoples efforts and livelihoods.
Garbage compared to Midjourney. I don't even know why you'd market this. It's takes a minute or more and the results are what I'd say Midjourney looked like 1.5 years ago.
4o Image Generation
(openai.com)1071 points by meetpateltech 25 March 2025 | 598 comments
Comments
Example: Ask it to draw a notepad with an empty tic-tac-toe, then tell it to make the first move, then you make a move, and so on.
You can also do very impressive information-conserving translations, such as changing the drawing style, but also stuff like "change day to night", or "put a hat on him", and so forth.
I get the feeling these models are quite restricted in resolution, and that more work in this space will let us do really wild things such as ask a model to create an app step by step first completely in images, essentially designing the whole app with text and all, then writing the code to reproduce it. And it also means that a model can take over from a really good diffusion model, so even if the original generations are not good, it can continue "reasoning" on an external image.
Finally, once these models become faster, you can imagine a truly generative UI, where the model produces the next frame of the app you are using based on events sent to the LLM (which can do all the normal things like using tools, thinking, etc). However, I also believe that diffusion models can do some of this, in a much faster way.
https://mordenstar.com/blog/chatgpt-4o-images
It's definitely impressive though once again fell flat on the ability to render a 9-pointed star.
Then google:
> Gemini 2.5: Our most intelligent AI model
> Introducing Gemini 2.0 | Our most capable AI model yet
I could go on forever. I hope this trend dies and apple starts using something effective so all the other companies can start copying a new lexicon.
No API yet, and given the slowness I imagine it will cost much more than the $0.03+/image of competitors.
Was anyone else surprised how slow the images were to generate in the livestream? This seems notably slower than DALLE.
Currently, my prompts seem to be going to the latter still, based on e.g. my source image being very obviously looped through a verbal image description and back to an image, compared to gemini-2.0-flash-exp-image-generation. A friend with a Plus plan has been getting responses from either.
The long-term plan seems to be to move to 4o completely and move Dall-E to its own tab, though, so maybe that problem will resolve itself before too long.
I'm not saying that it's not true, it's just "wait and see" before you take their word as gold.
I think MS's claim on their quantum computing breakthrough is the latest form of this.
I was blown away when they showed this many months ago, and found it strange that more people weren't talking about it.
This is much more precise than the Gemini one that just came out recently.
How much longer until an AI that can generate 30 frames with this quality and make a movie?
About 1.5 years ago, I thought AI would eventually allow anyone with an idea to make a Hollywood quality movie. Seems like we're not too far off. Maybe 2-3 more years?
To think that a few years ago we had dreamy pictures with eyes everywhere. And not long ago we were always identifying the AI images by the 6 fingered people.
I wonder how well the physics is modeled internally. E.g. if you prompt it to model some difficult ray tracing scenario (a box with a separating wall and a light in one of the chambers which leaks through to the other chamber etc)?
Or if you have a reflective chrome ball in your scene, how well does it understand that the image reflected must be an exact projection of the visible environment?
For example, I asked it to render a few lines of text on a medieval scroll, and it basically looked like a picture of a gothic font written onto a background image of a scroll
Asking it to draw the Balkans map in Tolkien style, this is actually really impressive, geography is more or less completely correct, borders and country locations are wrong, but it feels like something I could get it to fix.
Generate a photo of a lake taken by a mobile phone camera. No hands or phones in the photo, just the lake.
The hand holding a phone is always there :D
Theme: Educational Scientific Visualization – Ultra Realistic Cutaways Color: Naturalistic palettes that reflect real-world materials (e.g., rocky grays, soil browns, fiery reds, translucent biological tones) with high contrast between layers for clarity Camera: High-resolution macro and sectional views using a tilt-shift camera for extreme detail; fixed side angles or dynamic isometric perspective to maximize spatial understanding Film Stock: Hyper-realistic digital rendering with photogrammetry textures and 8K fidelity, simulating studio-grade scientific documentation Lighting: Studio-quality three-point lighting with soft shadows and controlled specular highlights to reveal texture and depth without visual noise Vibe: Immersive and precise, evoking awe and fascination with the inner workings of complex systems; blends realism with didactic clarity Content Transformation: The input is transformed into a hyper-detailed, realistically textured cutaway model of a physical or biological structure—faithful to material properties and scale—enhanced for educational use with visual emphasis on internal mechanics, fluid systems, and spatial orientation
Examples: 1. A photorealistic geological cutaway of Earth showing crust, tectonic plates, mantle convection currents, and the liquid iron core with temperature gradients and seismic wave paths. 2. An ultra-detailed anatomical cross-section of the human torso revealing realistic organs, vasculature, muscular layers, and tissue textures in lifelike coloration. 3. A high-resolution cutaway of a jet engine mid-operation, displaying fuel flow, turbine rotation, air compression zones, and combustion chamber intricacies. 4. A hyper-realistic underground slice of a city showing subway lines, sewage systems, electrical conduits, geological strata, and building foundations. 5. A realistic cutaway of a honeybee hive with detailed comb structures, developing larvae, worker bee behavior zones, and active pollen storage processes.
One area where it does not work well at all is modifying photographs of people's faces.* Completely fumbles if you take a selfie and ask it to modify your shirt, for example.
* = unless the people are in the training set
https://news.ycombinator.com/item?id=42628742
The new one can.
https://chatgpt.com/share/67e36dee-6694-8010-b337-04f37eeb5c...
The glaring issue for the older image generators is how it would proudly proclaim to have presented an image with a description that has almost no relation to the image it actually provided.
I'm not sure if this update improves on this aspect. It may create the illusion of awareness of the picture by having better prompt adherence.
It's much better than prior models, but still generates hands with too many fingers, bodies with too many arms, etc.
> Developers will soon be able to generate images with GPT‑4o via the API, with access rolling out in the next few weeks.
That's it folks. Tens of thousands of so-called "AI" image generator startups have been obliterated and taking digital artists with them all reduced to near zero.
Now you have a widely accessible meme generator with the name "ChatGPT".
The last task is for an open weight model that competes against this and is faster and all for free.
Even when I told it to transform it into a text description, then draw that text description, my earlier attempt at a cat picture meant that the description was too close to a banned image...
I can't help but feel like openAI and grok are on unhelpful polar opposites when it comes to moderation.
In the coming days, people will Anime all sorts of images, for example historical images: https://x.com/keysmashbandit/status/1904764224636592188
Trying out 4o image generation... It doesn't seem to support this use-case at all? I gave it an image of myself and asked to turn me into a wizard, and it generate something that doesn't look like me in the slightest. A second attempt, I asked to add a wizard hat and it just used python to add a triangle in the middle of my image. I looked at the examples and saw they had a direct image modification where they say "Give this cat a detective hat and a monocle", so I tried that with my own image "Give this human a detective hat and a monocle" and it just gave me this error:
> I wasn't able to generate the modified image because the request didn't follow our content policy. However, I can try another approach—either by applying a filter to stylize the image or guiding you on how to edit it using software like Photoshop or GIMP. Let me know what you'd like to do!
Overall, a very disappointing experience. As another point of comparison, Grok also added image generation capabilities and while the ability to edit existing images is a bit limited and janky, it still manages to overlay the requested transformation on top of the existing image.
Iterations are the missing link. With ChatGPT, you can iteratively improve text (e.g., "make it shorter," "mention xyz"). However, for pictures (and video), this functionality is not yet available. If you could prompt iteratively (e.g., "generate a red car in the sunset," "make it a muscle car," "place it on a hill," "show it from the side so the sun shines through the windshield"), the tools would become exponentially more useful.
I‘m looking forward to try this out and see if I was right. Unfortunately it’s not yet available for me.
Otherwise impressive.
I think it is too biased to use heuristics discovered in the first response to apply the same level of compute to subsequent requests.
It makes me kind of want to rewrite an interface that builds appropriate context and starts new chats for every request issued..
Am I the only one immediately looking past the amazing text generation, the excellent direction following, the wonderful reflection, and screaming inside my head, "That's not how reflection works!"
I know it's super nitpicky when it's so obviously a leap forward on multiple other metrics, but still, that reflection just ain't right.
--
Comparison with Leonardo.Ai.
ChatGPT: https://chatgpt.com/share/67e2fb21-a06c-8008-b297-07681dddee...
ChatGPT again (direct one shot): https://chatgpt.com/share/67e2fc44-ecc8-8008-a40f-e1368d306e...
ChatGPT again (using word "photorealistic instead of "photo"): https://chatgpt.com/share/67e2fce4-369c-8008-b69e-c2cbe0dd61...
Leonardo.Ai Phoenix 1.0 model: https://cdn.leonardo.ai/users/1f263899-3b36-4336-b2a5-d8bc25...
nah. i pass and stick with midjourney.
EDIT: Seems not, "The smallest image size I can generate is 1024x1024. Would you like me to proceed with that, or would you like a different approach?"
How easy is this to remove? Is it just like exif data that can be easily stripped out, or is it baked in more permanently somehow
I couldn't find anything on the pricing page.
If the subject matter is paywalled, I feel that the post should include some explanation of what is newsworthy behind the link.
Sorry, but how are these useful? None of the examples demonstrate any use beyond being cool to look at.
The article vaguely mentions 'providing inspiration' as possible definition of 'useful'. I suppose.
And I hope that people who worked on this know this. They are pure evil.
https://imgur.com/a/aS8e0UY
so much fun.
...Once the wait time is up, I can generate the corrected version with exactly eight characters: five mice, one elephant, one polar bear, and one giraffe in a green turtleneck. Let me know if you'd like me to try again later!
ofc 4.5 is best, but its slow and I am afraid I'm going to hit limits.
Was it public information when Google was going to launch their new models? Interesting timing.