The L2 already keeps track of what lines are somewhere in L1's for managing coherency.

Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?

It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.

this "divide" thing will add latency which you really do not want to add to L1 hits