We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.
Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.
The "embarrassingly simple inference technique" is to put a bunch of [MASK] tokens at the end of the prompt.
I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?
From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.
As someone who has very limited understanding but tried to use BERT for classification, is BERT still relavant when compared to LLMs ? Asking because I hardly see any mention of BERTs anymore.
BERTs Are Generative In-Context Learners
(arxiv.org)104 points by fzliu 18 hours ago | 26 comments
Comments
We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.
Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.
I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?
From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.