Kind of. Autoencoders don’t need to have an embedding that’s smaller than the input. Their only requirement is that they compress information and thus create reconstruction loss. Typically however they are not trained this way because they don’t converge.. transformers do the same thing, but they can squeeze much more bits of information through one pass because the way they are designed. This holds true even for decoder only networks because they’re still doing the same thing

If the embedding isn’t smaller than the input, how is it compressing information? It might lose information in its mapping to the embedding space, but in my understanding, the definition of compression means it has to use less bits than the original to hold the same information. As such, the embedding space must be smaller.