You usually hope that TTD points to the culprit in such situations. But once I encountered single-byte corruption that didn't make any sense in TTD trace, there was good value at write and next read was garbage. I never discovered whether that was CPU bug, corruption by GPU shaders, stray kernel writes, or whatever.(I think it's unlikely that CPU bug would manifest with both native and TTD-instrumented runs. Corrupted byte was inside heap allocated memory so it shouldn't be in GPU pagetables at all. Kernel writes wouldn't appear in TTD trace, so really I think that was most likely issue, but how to debug that...)
Part-2 is more than a paragraph and is logically distinct from Part-1. In this, Raymond actually gets the crucial clue from another colleague's debugging efforts which leads him to identify that the bottom byte of HMODULE of the DLL gets overwritten by <something> which is the root cause of the bug; viz.
The “DLL unmapped from memory” crash is just an alternate manifestation of the “somebody is writing 01 bytes to places they shouldn’t” bug. The original bug had a larger bucket spray than we initially thought.
Part-2 is the essence of the solution while Part-1 is a series of investigations and inferences.
Part 1 was interesting; it isn't clear why he split that into a Part 2 since it adds little to the story and is a paragraph long.
I assume the fact it is a third party application means debugging gets harder, and the business case for doing so is weaker/none.
But I would hope that some kind of reverse debugger triggered on one of these crashes would make it pretty simple to say "who wrote this 01".
You usually hope that TTD points to the culprit in such situations. But once I encountered single-byte corruption that didn't make any sense in TTD trace, there was good value at write and next read was garbage. I never discovered whether that was CPU bug, corruption by GPU shaders, stray kernel writes, or whatever.(I think it's unlikely that CPU bug would manifest with both native and TTD-instrumented runs. Corrupted byte was inside heap allocated memory so it shouldn't be in GPU pagetables at all. Kernel writes wouldn't appear in TTD trace, so really I think that was most likely issue, but how to debug that...)
You could also look at modules loaded into all of those processes that crashed this way.
Might have been an “I need to look into this” segueing into “ never mind”?
Part-2 is more than a paragraph and is logically distinct from Part-1. In this, Raymond actually gets the crucial clue from another colleague's debugging efforts which leads him to identify that the bottom byte of HMODULE of the DLL gets overwritten by <something> which is the root cause of the bug; viz.
The “DLL unmapped from memory” crash is just an alternate manifestation of the “somebody is writing 01 bytes to places they shouldn’t” bug. The original bug had a larger bucket spray than we initially thought.
Part-2 is the essence of the solution while Part-1 is a series of investigations and inferences.