Totally tangential, but I love to read post-mortems of people fixing bugs. What were the initial symptoms? What was your first theory? How did you test it? What was the resolution? Raymond Chen does this a lot and I've always enjoyed it.

I learn more from these concrete case studies than from general principles (though I agree those are important too).

One of my most recent bugs was a crash bug in a thread-pool that used garbage-collected objects (this is in C++) to manage network connections. Sometimes, during marking, one of the objects I was trying to mark would be already freed, and we crashed.

My first thought was that this was a concurrency problem. We're supposed to stop all threads (stop the world) during marking, but what if a thread was not stopped? Or what if we got an event on an IO completion port somehow during marking?

I was sure that it had to be a concurrency problem because (a) it was intermittent and (b) it frequently happened after a connection was closed. Maybe an object was getting deleted during marking?

The only thing that was weird was that the bug didn't happen under stress (I tried stress testing the system, but couldn't reproduce it). In fact, it seemed to happen most often at startup, when there weren't too many connections or threads running.

Eventually I proved to myself that all threads were paused properly during marking. And with sufficient logging, I proved that an object was not being deleted during marking, but the marking thread crashed anyway.

[Quick aside: I tried to get ChatGPT to help--o3 pro--and it kept on suggesting a concurrency problem. I could never really convince it that all threads were stopped. It always insisted on adding a lock around marking to protect it against other threads.]

The one thing I didn't consider was that maybe an object was not getting marked properly and was getting deleted even though it was still in use. I didn't consider it because the crash was in the marking code! Clearly we're marking objects!

But of course, that was the bug. Looking at the code I saw that an object was marked by a connection but not by the listener it belonged to. That meant that, as long as there was a connection active, everything worked fine. But if ever there were no connections active, and if we waited a sufficient amount of time, the object would get deleted by GC because the listener had forgotten to mark it.

Then a connection would use this stale object and on the next marking pass, it would crash.

> Totally tangential, but I love to read post-mortems of people fixing bugs.

I know I already posted it moons ago but... Around 1991 I made a little game, similar to Canon Ball on the MSX (which later on Pang / Buster Bros did copy).

I had one weird case where sometimes the game would crash. Plain crash. But sometimes after playing for 15 minutes and already passing several levels. I just couldn't find it. I couldn't reason about it. I was stuck.

So I decided to rewrite not the entire game but the part dealing with the inputs / game logic to make it 100% deterministic. It took me a long time to do that. Then eventually I could record myself playing: I've record only the player inputs and at which moment they'd happen, which would make for tiny savefiles btw.

And eventually while I was playing and recording, the crash occured. I tried my replay: it worked... It replayed the savefile flawlessly and the game crashed again.

At that point I knew the bug was gone: being able to reproduce a bug in a deterministic way means I was going to fix it.

Turns out it was a dangling pointer (ah, C...): when the hero would catch an extra allowing him to fire two shots at once on screen (usually he'd only be allowed one) and would the first shot kill the last ball on screen, then on the next level the second shot would somehow (due to an error on my part) continue to live its live, eventually corrupting memory.

Fun stuff.

FWIW having deterministic game engines wasn't a thing back then. It became common later on, with games like Age of Empires and Warcraft III etc. using them: it was obvious for savefiles allowing to replay/exchangs games were tinier than tiny: they'd only save at which frame a player input happened and they'd replay the entire game from there [1]

I still have the code to that game, it was surprisingly fun. Someone here already offered help in the past to get it back on its feet. I've also got an executable that runs. I just don't remember how the tooling worked. Still have the BAT files etc. to build it, but not the tools. I really should look into that one of these days but I'm kinda busy with other projects / life.

[1] which also raised another issue: when game engines were upgraded, you could have savefiles only playing on older version of the game, so players would exchanges games and add the version of the game they were destined to