BERTs Are Generative In-Context Learners

(arxiv.org)

Comments

patelajay285 13 hours ago
We found the same result a few years ago in our ICLR paper: https://arxiv.org/pdf/2209.14500

We found Google's T5 models which were released in 2019, pre-GPT-3, were "secretly" capable of in-context learning with a simple inference technique.

Given they use a bidirectional MLM (Masked Language Modeling) objective, it wasn't obvious how to do it, but MLM objectives are known to produce better language representations than causal (next token prediction) objectives. We were able to outperform much larger sized GPT-3 models or get very close to their performance with far smaller T5 models.

jsenn 13 hours ago
The "embarrassingly simple inference technique" is to put a bunch of [MASK] tokens at the end of the prompt.

I'm having trouble understanding whether this paper is saying anything new. The original BERT paper already compared it favourably to causal models including GPT. Was there any doubt that BERT-style models could be in-context learners?

From what I gather as a non-expert, the problem with BERT is scaling/training efficiency: GPT gets C-1 training examples out of a training input of length C, but BERT only gets 0.15*C examples. Indeed, the author points out that DeBERTa required 3x more compute than GPT-3 to achieve the level of performance reported, which makes sense.

Ldorigo 13 hours ago
Do we a know whether the current SOTA foundation models (Gemini, gpt4o, Claude, etc) are actually all GPT-based (as in, causal models)?
srameshc 14 hours ago
As someone who has very limited understanding but tried to use BERT for classification, is BERT still relavant when compared to LLMs ? Asking because I hardly see any mention of BERTs anymore.