Hacker News

radarsat1 2 months ago [ - ]

This reminds me a lot of the tricks to turn BERT into a generative model. I guess the causal masking that keeps it to essentially be autoregressive is an important difference though. Kind of best of both worlds.

krackers a month ago [ - ]

Masked language modeling has been compared loosely to text diffusion [1], so the paper's title claim may be loosely true in some sense even if it's misleading.

[1] https://nathan.rs/posts/roberta-diffusion/

a month ago [ - ]

[deleted]