what is there to improve? the transformer architecture is extremely simple. you gonna add another kv layer? you gonna tweak the nonlinearities? you gonna add 1 to one of the dimensions? you gonna inject a weird layer (which could have been in the weights anyways due to kolmogorov theorem)?

realistically the best you could do is evolve the prompt. maybe you could change input data preprocessing?

anyways the idea of current llm architectures self-improving via its own code seems silly as there are surprisingly few knobs to turn, and it's ~super expensive to train.

as a side note it's impressive how resistant the current architecture is to incremental RL away from results, since if even one "undesired input" result is multiple tokens, the coupling between the tokens is difficult to disentangle. (how do you separate jinping from jin-gitaxias for example)

Id like to see what happens if you change the K,V matrix into a 3 dimensional tensor.