There isn't enough input electrodes to encode a doom frame into the multi electrode array without compression.
That's all the artificial neural networks are doing.
If we could have gotten an MEA with 320x200 electrodes we wouldn't have used any encoding and just let the neurons figure it out. Instead it is an 8x8 grid.
We've got LLMs that seem to be smarter than anyone I'm talking to day-to-day and one useful model of them is just "compression". Compression is turning out to be a pretty key operation in intelligence and understanding (in fact, it seems to be intelligence and understanding in key ways). If we compress Doom into "shoot" and "the press buttons in the most favourable way to the player" then good compression could let a fair coin play Doom well if someone flips it fast enough.
I mean maybe ANN just means sampling the screen in which case I'm not sure why we're talking about it as a "network". But the type of compression seems critical.
Have I watched any of the videos or read the code? No I have not.