Hacker News

I worked a bit on the extraction process so I can chime in here a bit. The first part is to just mark the x,y locations of where all the bits are, generally by the intersection of the rows and columns of the microcode array.

Then you have to classify them as 0's or 1's. Each is visually distinct, a 1 being encoded by the presence of a transistor and a gap in the polysilicon. We didn't have to guess which is which is by the nature of Intel microcode we could assume 0's were much more frequent, so a transistor meant a 1.

There are some automatic tools designed to perform this work via color thresholding, but they didn't work very well here because some of the mosaic was blurry, and a lot of dust had crept in which created false 1 bits.

Instead, we trained a convolutional neural network to classify the extracted bit regions into 0's and 1's. This was overlaid back onto the original mosaic as white or black squares at 50% opacity.

Then we spent several long, tedious days just checking the results for errors. Finally we had the raw 2d array of bits - the next step is to extract the microcode words from the bit array.

Intel had given us some clues - they had written somewhere that the 386 had 2560 microcode words. The microcode array has 37 banks - each bank resolves one bit from the 37 bits that comprise a microcode word. But which way to decode them? From top down? Bottom up? Were they interleaved in weird ways?

Documentation from the NEC vs Intel lawsuit ended up documenting the microcode word format for both the 8088 and NEC V20 CPUs, but unfortunately, we were on our own for the 386. But we could take educated guesses - working off the 8088 field format, what additional microcode fields would a 386 add? What fields would expand and how many bits would they need?

We used a lot of python scripts to decode the microcode array into 37-pixel wide, very long bitmaps, in different permutations, to see if any vertical patterns emerged that would hint to us the boundaries of microcode word fields. And some did emerge!

GloriousCow 3 hours ago [ - ]

We also had decoded the 386's match-decoder PLA, so we knew roughly the locations of different opcodes were in the microcode itself, which was very helpful. Some opcodes have very specific operands, so would have unique field references. Some forms only operate on EAX/AX, for example, so if you find those instructions you have a hint of how the AX register is encoded as an operand.

Other instructions like PUSHA and POPA are implemented as loops that iterate by incrementing the fields corresponding to registers - and we know in what order they operate.

Bit by bit, relation by relation, you can puzzle out the format of the microcode. Of course, this is glossing over the enormous added complexity of protected-mode operations. This was a herculean effort by reenigne, and I don't think it is hyperbole to call it one of the more impressive human achievements I have witnessed in my lifetime.

GloriousCow 2 hours ago [ - ]

The actual output of microcode disassembly is just a text file - a line of code for each microcode word, in essentially a a new dialect - a type of static assembly language. reenigne had to invent names for a lot of things, that will now become the official names of these things, unless Intel ever decides to speak up and make corrections.

That language can then be translated into Verilog, and has been.