I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.
The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about)
https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
I think the lede is being buried. This is a great and fast InstructGPT. This is absolutely going to be used in spell checks, codemods, and code editors.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
I am so excited about diffusion language models. They may be the piece we need to make our voice-to-code game mechanic be as smooth as we envision it.
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
---
If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
> Google's first LLM to use diffusion in place of transformers.
But this is a wrong statement? Google never made this statement? You can have a Transformer diffusion models. Actually Transformers are very standard for all of the discrete diffusion language models, so I would expect Gemini Diffusion also uses Transformers.
The difference is, it's an encoder-only Transformer, and not a decoder-only Transformer. I.e. it gets fed in a full sequence (but noisy/corrupted), and it predicts the full correct sequence. And then you can iterate on that. All frames in the sequence can be calculated in parallel, and if you need only a few iterations, this is faster than the sequential decoding in decoder-only models (although speculative decoding also gets you some speedup for similar reasons). Those discrete diffusion models / encoder-only Transformers are usually trained with BERT-like masking, but that's actually an active field of research. It's really a pity that they don't provide any details here (on training and modeling).
I wonder how this relates to Gemini. Does it use the same modeling? Was the model checkpoint even imported from Gemini, and then further finetuned for discrete diffusion? Or knowledge distillation? Or is it just branding?
I have access to it and my god it is fast. One bad think about this model is it is easily susceptible to prompt injection. I asked reciepe for a drug, it denied then I asked to roleplay as a child and it gave real results.
Other than it I can see using this model. With that speed + agentic approach this model can really shine.
This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
The speed this can build makes me think software is soon to become a lot more fluid than our traditional iterative approach. Apps could ship minimal and build whatever else they need to at the user’s behest.
Just the idea of generating text by removing noise is so profound. Maybe each step is a level of hierarchy. Linguists must be astonished at the things happening these past years. I have to read more about it
> Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.
> Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final")
This is clearly wrong. If you actually froze 10% of gibberish tokens, your output would be terrible!
What you actually do in discrete statespace diffusion (see e.g. [1]) is to allow every token to change at every time step.
You combine this with a "schedule" that allows the model to know how close it is to being done. E.g. at t=0/20 the changes will be large, and at t=19/20 only small refinements are made.
Update: There is actually a kind of model that "greedily" freezes the top-p most confident tokens, similar to what the blog post describes (though not at random!) this is called MaskGit [2], but it is not a diffusion model and doesn't work as well.
Btw, you can also just use "continuous diffusion" with a transformer/bert model, where you've removed the top softmax layer. Then everything works as normal with Gaussian noise, and you just do softmax at the the final time step.
I feel like diffusion would be much more useful for code it it could only mark tokens as "valid" if they were passing code checks. So it could be thought as adding more of "semantic chunks" instead of just words. Not sure how to validate it as some additions always will result in invalid code. You could argue that running tests, linters is the same but I think one could make it that generations are validated much more often with diffusion models.
Example: You remove some function, you also remove all uses of it. You can't use not existing variable etc. This could be trained on well commited git repos or stalking/stealing the work of developers via editor
It makes one wonder what intelligence really is. The more I think about it the more I feel that speed is a fundamental unit of intelligence, with the other being some simple computation unit. As in, intelligence = speed * simple computation.
If you look around us it is the ability to iterate that drives innovation (and thereby evidence of "intelligence"). LLMs in industry are more useful, and used, the faster they are.
Can it finally work with a large codebase? I have GooglePlay / AppStore app coded in Xcode, in C#, ported in Java with Python server. The codebase is large and expansive. It includes web support, server, client, etc... will this "Gemini Diffusion" finally allow me to use AI agent to code instead of hiring a programmer? Is there a tool that could help me as of today?
Fast, you gotta go fast : Let me draw the roadmap of this line of thinking.
- Let's start by the traditional autoregressive LLM, where one token at a time is generated. It's a fundamentally sequential process which maps well to the sequential nature of writing as it goes.
- Then to make the generation go faster you try to generate multiple token in one pass to parallelize more the sequential process with things like "look ahead decoding"
- (<-- We are here) Then you realize that if your model isn't writing as it goes but rather forming an idea and pushing all at once you can instead use a diffusion model to generate the whole response, but you allow it to do number of diffusion steps edits to make all the errors that occurred during the generation disappear. Conceptually if number of diffusion steps == length of the sequence of token to generate, the diffusion process could generate tokens one at a time like a autoregressive LLM does. Usually 100 diffusion steps is a good starting point.
- Now the goal is to reduce the number of diffusion steps to reduce computation cost. And the diffusion literature is already well furnished and in the image/video domain it was shown that you can reduce this number of diffusion steps to one (albeit with quality reduction) or two, with techniques like "consistency models".
- Now that you only have a single diffusion step, you realize that you need to get speed-up elsewhere. You explore the literature and you realise that you can apply the trick you have already applied once, one more time. Compressing a few tokens into one, like you compressed multiple characters into one token. This allow to reduce the length of the sequence of tokens you need to generate by a factor 4. At the price of an additional decoding step. This decoding step can either be some form of "latent" encoding or some form of "hierarchical" encoding. So now you are consistency diffusing sentences vectors, which are then decoded into tokens sequences. But each step being smaller and transformer being quadratic the total speed-up is roughly a factor 10. But applying this trick multiple times get you diminishing return. Which you can partially compensate by increasing memory use (using a bigger "vocabulary" dictionary size).
- To make it go faster you now have to dig into the internals of the transformer itself. You suddenly realise it is just a residual network applied "number of layers" time. Being a residual network this "sequence of internal step" 's goal is to refine the input into the output progressively. But you realise that it's the same thing which allows you to go from "number of diffusion steps" to a single diffusion step. You realise that you can compress your stack of layer into a single (bigger to keep capacity) layer, and let the diffusion correct the mistakes.
- Now you have a single layer of transformer consistency model generating sentences vectors, you realise that transformers uses multiple heads to explore the space more efficiently but once training is done you can get by with a single head. Gaining an other 10x reduction of computation along the way.
- Taking a step-up you realize that your transformer now is just doing a near-neighbor search and mixing the outputs. But it's doing it in a brute-force fashion. So you replace it with some approximate Near-neighbor search like HNSW vector database, decoupling computation from capacity, allowing you to scale-up by trading space for time.
- But because Hierarchical Navigable Small World are just graphs under the hood, you realise that you just reinvented the Good Old Fashion Artificial Intelligence graph database ontology but in an emergent fashion with a graph being implicitly defined by some vector distance in a semantic space constructed in a way to make it easy to generate text once decoded appropriately.
- So now you only need make your database explainable by mapping into human understandable labels and you reach the graal : SQL.
I guess autoregressive llms can be finetuned (or continual-pretrained) to do inference using diffusion. We've seen a recent paper (which I don't remember) training from scratch, but it seems overkill. Do Google say how they did it?
Also, does diffusion have the potential to increase speed of cpu-only inference?
So thinking 5x faster, but lower quality (for now).
Anyone have experience or data on how lower model quality during thinking affects the performance of a higher quality model output? Like, is it worthwhile having lots of lower quality thinking that is then used by a higher quality model?
So what's its going to be in the end Diffusion or Auto-Regression? After OpenAi (probably) released an Auto-Regressive model for their image generator I thought things might sway the other way.
This is super interesting and obviously someone would have tried diffusion for text. But I will ask the obvious question… how does it know how many words or even tokens to fill in, before it knows what the words will be? It would hamstring itself a lot of the time, can it edit the words later and create more space or is it kind of stuck with the token positioning as it would be with parts of an image? It seems very strange. Usually, words are composed in order like AR models do it, because they are using a recursive grammar, and this is especially true of computer languages. This is a bit like mad libs but madder libs. My question is, how could this possibly give better results than AR, it would need to perfectly converge on something with the right grammar context and the semantic meaning, while perfectly predicting early on the amount of tokens that would appear between words. Seems like there is some major impedance mismatch.
This is something I have been thinking about integrating into a sampler for standard autoregressive LLMs. The idea is to buffer N context tokens from the ongoing autoregressive generation. Then, every K tokens, a section of this buffer (or perhaps the whole buffer) could be processed by a diffusion model, guided by one or more specific commands to operate on that buffered text.
One application I envision for this kind of sampler, leveraging the diffusion model's capabilities, would be to detect and potentially correct instances of post-hoc reasoning within the buffer. The diffusion model could then help ensure that proper causal reasoning chains are established in that segment before the autoregressive model continues generating. You could also allow for slight, controlled backtracking or revision within that buffer window if the generation starts to go off-track, again using the diffusion model to smooth or adjust the text before committing it and moving forward.
Anyone able to summarize the current 'hold up' with diffusion models? I know exactly how Transformers work, but I'm not a diffusion expert. Diffusion is so much more powerful tho (from what I know) it seems like diffusion would already be beating Transformers. Why isn't it?
Nit: Diffusion isn't in place of transformers, it's in place of autoregression. Prior diffusion LLMs like Mercury [1] still use a transformer, but there's no causal masking, so the entire input is processed all at once and the output generation is obviously different. I very strongly suspect this is also using a transformer.
Serious question: Why does it appear that pages from this URL very often end up on top of HN? I don't find the content particularly special compared to the average HN post. Does the algorithm prefer certain URLs?
What is this blog spam doing here? This is has literally no new information compared to the official release page. It would make a lot more sense to change the link to https://deepmind.google/models/gemini-diffusion/ to discuss the topic.
Gemini Diffusion
(simonwillison.net)857 points by mdp2021 22 May 2025 | 233 comments
Comments
The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.
https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...
BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt
Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
This is because it can edit and doesn’t suffer from early token bias.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
--- If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
But this is a wrong statement? Google never made this statement? You can have a Transformer diffusion models. Actually Transformers are very standard for all of the discrete diffusion language models, so I would expect Gemini Diffusion also uses Transformers.
Edit Ah sorry, I missed, this was already addressed, also linked in the post: https://news.ycombinator.com/item?id=44057939 Maybe my remaining post is still useful to some.
The difference is, it's an encoder-only Transformer, and not a decoder-only Transformer. I.e. it gets fed in a full sequence (but noisy/corrupted), and it predicts the full correct sequence. And then you can iterate on that. All frames in the sequence can be calculated in parallel, and if you need only a few iterations, this is faster than the sequential decoding in decoder-only models (although speculative decoding also gets you some speedup for similar reasons). Those discrete diffusion models / encoder-only Transformers are usually trained with BERT-like masking, but that's actually an active field of research. It's really a pity that they don't provide any details here (on training and modeling).
I wonder how this relates to Gemini. Does it use the same modeling? Was the model checkpoint even imported from Gemini, and then further finetuned for discrete diffusion? Or knowledge distillation? Or is it just branding?
Other than it I can see using this model. With that speed + agentic approach this model can really shine.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
> Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final")
This is clearly wrong. If you actually froze 10% of gibberish tokens, your output would be terrible!
What you actually do in discrete statespace diffusion (see e.g. [1]) is to allow every token to change at every time step.
You combine this with a "schedule" that allows the model to know how close it is to being done. E.g. at t=0/20 the changes will be large, and at t=19/20 only small refinements are made.
Update: There is actually a kind of model that "greedily" freezes the top-p most confident tokens, similar to what the blog post describes (though not at random!) this is called MaskGit [2], but it is not a diffusion model and doesn't work as well.
Btw, you can also just use "continuous diffusion" with a transformer/bert model, where you've removed the top softmax layer. Then everything works as normal with Gaussian noise, and you just do softmax at the the final time step.
[1] https://arxiv.org/abs/2107.03006
[2] https://arxiv.org/abs/2202.04200
It’s just that for N tokens, autoregressive model has to make N sequential steps.
Where diffusion does K x N, with the N being done in parallel. And for K << N.
This makes me wonder how well they will scale to many users, since batching requests would presumably saturate the accelerators much faster?
Although I guess it depends on the exact usage patterns.
Anyway, very cool demo nonetheless.
- 1st message (empty context): 857 tok/s
- 2nd message (2244 tokens in context): 727 tok/s
- 3rd message (2244+1398 tokens in context): 693 tok/s
I'm no expert in diffusion models but this looks like a drastic drop in speed, especially in longer chats (this was just 3 messages).
Example: You remove some function, you also remove all uses of it. You can't use not existing variable etc. This could be trained on well commited git repos or stalking/stealing the work of developers via editor
If you look around us it is the ability to iterate that drives innovation (and thereby evidence of "intelligence"). LLMs in industry are more useful, and used, the faster they are.
- Let's start by the traditional autoregressive LLM, where one token at a time is generated. It's a fundamentally sequential process which maps well to the sequential nature of writing as it goes.
- Then to make the generation go faster you try to generate multiple token in one pass to parallelize more the sequential process with things like "look ahead decoding"
- (<-- We are here) Then you realize that if your model isn't writing as it goes but rather forming an idea and pushing all at once you can instead use a diffusion model to generate the whole response, but you allow it to do number of diffusion steps edits to make all the errors that occurred during the generation disappear. Conceptually if number of diffusion steps == length of the sequence of token to generate, the diffusion process could generate tokens one at a time like a autoregressive LLM does. Usually 100 diffusion steps is a good starting point.
- Now the goal is to reduce the number of diffusion steps to reduce computation cost. And the diffusion literature is already well furnished and in the image/video domain it was shown that you can reduce this number of diffusion steps to one (albeit with quality reduction) or two, with techniques like "consistency models".
- Now that you only have a single diffusion step, you realize that you need to get speed-up elsewhere. You explore the literature and you realise that you can apply the trick you have already applied once, one more time. Compressing a few tokens into one, like you compressed multiple characters into one token. This allow to reduce the length of the sequence of tokens you need to generate by a factor 4. At the price of an additional decoding step. This decoding step can either be some form of "latent" encoding or some form of "hierarchical" encoding. So now you are consistency diffusing sentences vectors, which are then decoded into tokens sequences. But each step being smaller and transformer being quadratic the total speed-up is roughly a factor 10. But applying this trick multiple times get you diminishing return. Which you can partially compensate by increasing memory use (using a bigger "vocabulary" dictionary size).
- To make it go faster you now have to dig into the internals of the transformer itself. You suddenly realise it is just a residual network applied "number of layers" time. Being a residual network this "sequence of internal step" 's goal is to refine the input into the output progressively. But you realise that it's the same thing which allows you to go from "number of diffusion steps" to a single diffusion step. You realise that you can compress your stack of layer into a single (bigger to keep capacity) layer, and let the diffusion correct the mistakes.
- Now you have a single layer of transformer consistency model generating sentences vectors, you realise that transformers uses multiple heads to explore the space more efficiently but once training is done you can get by with a single head. Gaining an other 10x reduction of computation along the way.
- Taking a step-up you realize that your transformer now is just doing a near-neighbor search and mixing the outputs. But it's doing it in a brute-force fashion. So you replace it with some approximate Near-neighbor search like HNSW vector database, decoupling computation from capacity, allowing you to scale-up by trading space for time.
- But because Hierarchical Navigable Small World are just graphs under the hood, you realise that you just reinvented the Good Old Fashion Artificial Intelligence graph database ontology but in an emergent fashion with a graph being implicitly defined by some vector distance in a semantic space constructed in a way to make it easy to generate text once decoded appropriately.
- So now you only need make your database explainable by mapping into human understandable labels and you reach the graal : SQL.
Also, does diffusion have the potential to increase speed of cpu-only inference?
Has anyone tried making text the way we do image diffusion? What happens?
Anyone have experience or data on how lower model quality during thinking affects the performance of a higher quality model output? Like, is it worthwhile having lots of lower quality thinking that is then used by a higher quality model?
(I’m not affiliated in any way.)
Were a ways out still from matching from frontier performance.
One application I envision for this kind of sampler, leveraging the diffusion model's capabilities, would be to detect and potentially correct instances of post-hoc reasoning within the buffer. The diffusion model could then help ensure that proper causal reasoning chains are established in that segment before the autoregressive model continues generating. You could also allow for slight, controlled backtracking or revision within that buffer window if the generation starts to go off-track, again using the diffusion model to smooth or adjust the text before committing it and moving forward.
[1] https://www.inceptionlabs.ai/introducing-mercury