Can't find the reference now, but remember reading an article on evolving FPGA designs. The found optimum however only worked on the specific FPGA it was evolved on, since the algo had started to use some out-of-spec "features" of the specific chip. Obviously that can be fixed with proper constraints, but seems like a trap that could be stepped into again - i.e. the LLM is now really fast but only on GPUs that come from the same batch of wafers.
https://www.researchgate.net/publication/2737441_An_Evolved_...