Hacker News

> We initialized it from a scaled-down version of GPT‑4o to take advantage of GPT models’ existing knowledge, then further trained it on a dataset composed mostly of protein sequences, along with biological text and tokenized 3D structure data, elements most protein language models omit.

> A large portion of the data was enriched to contain additional contextual information about the proteins in the form of textual descriptions, co-evolutionary homologous sequences, and groups of proteins that are known to interact.

These bits made me wonder what would have happened if they had only used the supplementary biological data with an untrained LLM model.