They trained it to be used like any other decoder only model. So text generation essentially. But you could use the encoder part for things like classification without much effort. Then again you can also slap a classifier head on any decoder model. The main reason they seem to be doing this is to have swappable encoder/decoder parts in an otherwise standard LLM. But I'm not sure if that is really something we needed.
I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.
Also a hint: you can create a finetuning dataset from a frontier LLM pretty easily to finetune those t5 and effectively distill them pretty fast these days
Only thing it buys you is a more “natural” embedding, i.e. the encoder can get you a bag o’ floats representing a text, but that also doesn’t mean it’s naturally a good embedding engine - I strongly assume you’d do further training.
Decoder gets you the autoregressive generation you’d use for an llm.
Beyond that, there’s this advantage of having small LLMs train better, they kinda hit a wall a year or two ago IMHO. E.g. original Gemma 3 small models were short context and only text.
As far as I understand you have to pay for that by 2x inference cost at runtime
(Would be happy to be corrected on any of the above, I maintain a multi platform app that has llama.cpp inference in addition to standard LLMs, and I do embeddings locally, so I’m operating from a practical understanding more than ML phd)
In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
The issue is that they're generally harder to train (need input/output pairs as a training dataset) and don't naturally generalize as well
≥In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
Decoder-only models also do this, the only difference is that they use a masked attention.
They trained it to be used like any other decoder only model. So text generation essentially. But you could use the encoder part for things like classification without much effort. Then again you can also slap a classifier head on any decoder model. The main reason they seem to be doing this is to have swappable encoder/decoder parts in an otherwise standard LLM. But I'm not sure if that is really something we needed.
Encoder/decoder is much, much more efficient for finetuning and inference than decoder-only models.
Historically T5 are good when you finetune them for task specific models (translation, summarization, etc).
I have actually worked on encoder-decoder models. The issue is, finetuning itself is becoming historic. At least for text processing. If you spend a ton of effort today to finetune on a particular task, chances are you would have reached the same performance using a frontier LLM with the right context in the prompt. And if a big model can do it today, in 12 months there will be a super cheap and efficient model that can do it as well. For vision you can still beat them, but only with huge effort the gap is shortening constantly. And T5 is not even multimodal. I don't think these will change the landscape in any meaningful way.
This t5 is multimodal.
Also a hint: you can create a finetuning dataset from a frontier LLM pretty easily to finetune those t5 and effectively distill them pretty fast these days
Only thing it buys you is a more “natural” embedding, i.e. the encoder can get you a bag o’ floats representing a text, but that also doesn’t mean it’s naturally a good embedding engine - I strongly assume you’d do further training.
Decoder gets you the autoregressive generation you’d use for an llm.
Beyond that, there’s this advantage of having small LLMs train better, they kinda hit a wall a year or two ago IMHO. E.g. original Gemma 3 small models were short context and only text.
As far as I understand you have to pay for that by 2x inference cost at runtime
(Would be happy to be corrected on any of the above, I maintain a multi platform app that has llama.cpp inference in addition to standard LLMs, and I do embeddings locally, so I’m operating from a practical understanding more than ML phd)
In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
The issue is that they're generally harder to train (need input/output pairs as a training dataset) and don't naturally generalize as well
≥In general encoder+decoder models are much more efficient at infererence than decoder-only models because they run over the entire input all at once (which leverages parallel compute more effectively).
Decoder-only models also do this, the only difference is that they use a masked attention.