Hacker News

The other way round, task specific DNNs adapted to share the same vector space as omni-transformers with generalized vision, audio encoders.

E.g. For an OCR task, the first pass will be handled by the CNN, converted to shared tokens which the transformer can consume, correct any issues if needed and a decoder that can handle both the DNN and transformer output.