Multi-Token Attention

(arxiv.org)

Comments

kouteiheika 17 hours ago
This is another potential improvement to the transformer architecture from Facebook (the other one that comes to mind is this one from same authors: https://arxiv.org/abs/2405.18719), but note that it comes with a major problem that might not be obvious at first glance: it's just not usable in practice without a ton of work. It modifies the innards of the attention mechanism, so it is incompatible with Flash Attention (or any other optimized attention library), and you do not want to train anything beyond toy models without Flash Attention (the performance hit is just way too big).

There's pytorch's FlexAttention which could maybe make this practical, but currently it's just way too buggy.

bionhoward 22 hours ago
How does this compare with Byte Latent Transformer [1]? This happens with convolution post-embedding while BLT happens with attention at embedding time?

1. https://ai.meta.com/research/publications/byte-latent-transf...

bigdict 22 hours ago
Sure, you can get better model performance by throwing more compute at the problem in different places. Does is it improve perf on an isoflop basis?
bob1029 22 hours ago
So, we're proposing a multiplicative increase of something that already scales quadratically with the context size?

I think we've already got a bit of a bottleneck in terms of memory bandwidth utilization.

rakejake 16 hours ago
Interesting. So they convolve the k,v, q vectors? I have been trying the opposite.

I have been working on a classification problem on audio data (with context size somewhere between 1000 and 3000 with potential to expand later). I have been experimenting with adding attention onto a CNN for a classification task I have been working on.

I tried training a vanilla transformer but in the sizes that I am aiming for (5-30M parameters), the training is incredibly unstable and doesn't achieve the performance of an LSTM.

So I went back to CNNs which are fast to train but don't achieve the losses of LSTMs (which are much slower to train,and for higher context sizes you get into the vanishing gradient problem). The CNN-GRU hubrid a worked much better, giving me my best result.

The GRU layer I used had a size of 512. For increasing context sizes, I'd have to make the convolutional layers deeper so as not to increase the GRU size too large. Instead, I decided to swap out the GRU with a MultiHeadAttention layer. The results are great - better than the CNN-GRU (my previous best). Plus, for equivalent sizes the model is faster to train though it hogs a lot of memory.

fabmilo 20 hours ago
We have to move past tokenization for the next leap in capabilities. All this work done on tokens, specially in the RL optimization contest, is just local optimization alchemy.
cgearhart 20 hours ago
Why is there an expectation that “nearby” tokens are relevant to increase the information in the similarities? That seems like it would hold true within individual words, but the whole point of attention was to solve long range dependencies. Reintroducing local windows seems like a step backwards to me.
curiousfiddler 19 hours ago
So, why would this extract more semantic meaning than multi-head attention? Isn't the whole point of multiple heads similar to how CNNs use multiple types of filters to extract different semantic relationships?
jwilber 22 hours ago
Achieved by “applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other's attention weights for more precise attention”

Cool to see convolutions making such a comeback lately in the llm world. See also the recent striped hyena2 architecture, which uses the conv-based hyena operator to great success:

https://arxiv.org/abs/2503.01868

antonkar 20 hours ago
There is a planet-wise eternal 100% safe AI solution that can be a billion dollar startup, too:

Put all the GPUs in cloud/s controlled by international scientists (now you can use your GPU on any device, can earn money by renting it when you don’t need it, nothing changes except you need to be online to us it, but we’ll have 5G and better worldwide. You can develop, sell or release free math-proven safe AI models in this cloud “AI App Store”, etc).

Because the main risk is an AI agent botnet - current GPUs are like nukes that are 100% unprotected - any hacker can make a virus with AI agent component just to steal money, this AI will be not aligned at all, will become a per perpetual and eventually autonomous botnet.