1 min readJun 23, 2019
BERT is not really an auto-encoder, in the sense that the prediction of non-masked words is ignored during training. That’s a very important distinction. It also uses an (non-generative) next-sentence prediction which is another form of self-supervision.