So is this basically a task-specific MoA transformer arch with a DNN that helps make routing decisions? Trying to understand this.
So is this basically a task-specific MoA transformer arch with a DNN that helps make routing decisions? Trying to understand this.
The other way round, task specific DNNs adapted to share the same vector space as omni-transformers with generalized vision, audio encoders.
E.g. For an OCR task, the first pass will be handled by the CNN, converted to shared tokens which the transformer can consume, correct any issues if needed and a decoder that can handle both the DNN and transformer output.