I was responding to their claim by starting with an assumption that it may be correct. I don't know the cost data myself. Now, I'll assume what you say is true.
That leaves computation and memory use of two passes plus interlayer communication.
I think backpropagation doesn't occur in the brain since it appears to use local learning but global optimization probably happens during sleep/dreaming. I have a lot of papers on removing backpropagation, Hebbien learning, and "local, learning rules."
From there, many are publishing how to do training at 8-bit and below. A recent one did a mix of low-bit training with sub-1-bit storage for weights. The NoLayer architecture might address interlayer better.
People keep trying to build analog accelerators. There are mismatches between their features and hardware. Recent works have come up with analog NN's that work well with analog hardware.
A combination of those would likely get cost down dramatically on both inference and training. Also, energy use would be lower.