In a CPU, there are 2 kinds of branch predictors, predictors for conditional branches, which must predict only whether the branch is taken or not taken, and predictors for indirect branch instructions, which must predict the address of the next instruction that will be executed after the indirect branch instruction.

The indirect branch predictor stores information about indirect branches in a structure that is similar to a cache memory.

When the CPU loads some data from the main memory, the loaded data together with the address from where it was loaded is stored in the cache memory.

When another load instruction is executed later, the load address is used to query the cache memory and if the query is successful the data value is returned by the query and it is used instead of accessing again the main memory.

Similarly, (simplifying a lot) when an indirect branch instruction is executed the address where the execution continues after the indirect branch is stored together with the address of the branch instruction in an associative memory similar to a cache memory.

When an indirect branch instruction is executed later, the address of the branch instruction is used to query the associative memory and if the query is successful it will return the address of the next instruction to be executed and execution will continue there speculatively, until it is confirmed by reading from the main memory that this is the correct jump address. If the jump address is wrong then all the work done is discarded and execution continues at the right address.

This description is simplified, because modern branch descriptors may store for a given branch instruction address not only the last jump address, but a sequence of past jump addresses, together with a pattern about how they have been used (e.g. alternating between jumping twice to the first address and jumping once to the second address). Thus successive queries for the branch instruction address will retrieve different jump addresses, based on their activation pattern.

The point of TFA is that if the dispatch loop contains a single indexed branch instruction, then the associative memory of the indirect branch predictor contains a single record, keyed by the address of that instruction. Depending on the CPU, the record will contain one or more past addresses, but it can predict correctly only a short periodic pattern of jump addresses, i.e. of opcodes used in dispatching.

This could work to predict correctly a short loop in the interpreted program, but typically most of the predictions will be wrong and each branch misprediction takes much more time than the execution of any instruction. In smaller and cheaper CPUs, which might store a single jump address per record, the correct predictions would be extremely rare, as they would happen only if the interpreted program contains repeated instructions.

In the optimized program, the indexed branch instruction is replicated into each "case" so now the associative memory will store separate records, one for each "case", containing the address or addresses of the next cases where execution has continued in the past.

Because after each interpreted instruction the probabilities of the next instructions are not equal, but some instructions are more likely to follow, that greatly increases the chances that the indirect branch predictor will make correct predictions from time to time.

In practice, branch target predictors are also used for conditional branches too, and even unconditional, direct jumps. Yes, the target can (sort of by definition) be computed from the instruction and the pc, but the by using the branch target buffer, you avoid pipeline bubbles in the correctly predicted case by only having a data dependency on the pc, and predicting the target while the instruction is still being fetched. That way on the next cycle you're already fetching from the predicted target while still decoding the branch instruction itself in the next stage(s).

I wonder are there any compilers that recognizes switch-in-loop pattern of interprets and optimizes it to be branch-predictor friendly using multiple indirect jumps?

I believe that Rust does this which implies that it might be common to the clang back end.

[flagged]

You really explain things in a way that's very easy to understand. I think I finally get it now. I understood the conditional branch part, but I think I was having trouble with the indirect branch predictor. So basically, when a branch instruction is executed, the CPU learns the actual target address. In the case of a switch statement, because the key is a single value like s, the history records get mixed up, whereas with a computed goto, since there are multiple keys, they don't get mixed up. Thank you for the clear explanation. It's a bit embarrassing that I didn't know this, but thanks to people like you, I feel like I'm learning more.