I'll admit to unfamiliarity with the .NET version, but for Python even `threading.local` is a useless implementation if you care at all about performance.

Performant thread-local variables require ahead-of-time mapping to a 1-or-2-level integer sequence with a register to quickly the base array, and some kind of trap to handle the "not allocated" case. Task-local variables are worse than thread-locals since they are swapped out much more frequently.

This requires special compiler support, not being a mere library.

I would argue that if you're using Python, you already don't care about performance (unless it's just a little glue between other things).

In .NET they do virtual dispatch via a very basic map-like interface that has a bunch of micro-optimized implementations that are swapped in and out as needed if new items are added. For N up to 4 variables, they use a dedicated implementation that stores them as fields and does simple branching to access the right one, for each N. Beyond that it becomes an array, and at some point, a proper Dictionary. I don't know the exact perf characteristics, but FWIW I don't recall that ever being a source of an actual, non-hypothetical perf problem. Usually you'll have one local that is an object with a bunch of fields, so you only need one lookup to fetch that, and from there it's as fast as field access.