Saturday, July 28, 2007

The Return of Hyper-Threading (Technology)

When one sees the comparisons between the upcoming AMD processors and Nehalem, many people seem to forget that Nehalem will mark the return of Hyper-Threading Technology (HTT for short). I know that the first instinct of some of the readers will be to laugh at this, and say that HTT failed before and will fail again, but bear with me, as I try to show you why this may be more important than you think.

HTT is just the commercial name Intel gives to their implementation of SMT (Simultaneous Multithreading). SMT was first researched by IBM in 1968, and saw its first implementation in a commercial microprocessor in the Alpha EV8 (never released). Other processors with SMT include the PowerPC G5 and the SUN Niagara. So now you see HTT is not a dumb concept invented by Intel.

When Intel released the first implementations of HTT, depending on the application, you could get a speed boost from 30% to ¡¡¡-30%!!! (yes, that is a minus).. Latter designs improved the situation yielding gains from 0% to 30% depending on workload (in both cases compared with a P4 with HTT disabled). Intel Indicated that, in the P4, the increase in silicon area to include HTT was around 5%.

¿So, why Intel incorporated HTT in the processors? A few of the reasons are:
* The P4 had really long pipelines (way too long in my opinion, but that is easy to say now in 2007). By putting HTT there, the processor could do useful work in case some execution units where idle, masking a little bit the long pipeline problem. Oh, and memory latency is always a problem for any µProcessor manufacturer.
* Intel is interested in selling more microprocessors, so even if they did not believe in multicores (at the time), they would have loved (at that time) to sell something like AMD’s 4x4. But, without the software makers getting serious, that was not possible, HTT was a cheap way to get “Multiprocessing” to the masses, to let people see the benefits, and to let developers have a test bed of sorts where to try multithreaded code. Actually, HTT is credited with finally forcing Creative Labs to correct the SoundBlaster drivers to work in machines with more than one Processor (real or virtual).
* To have another bell to list in the brochures in the fight against AMD (¡Nehner nehner! ¡We have SMT! ¡and you do-on‘t!).

So, if SMT is a good idea, used in many other processors aside from Intel’s, and if there were good reasons to launch it ¿Why it became a joke?

Let’s explore what can fail in an HTT implementation.

There are factors which are not under Intel’s control, like applications not being designed to profit from HTT, or the OS scheduler may do a very dumb work assigning threads to the real or virtual processors.

There are others that are, more or less, under Intel’s control:
A.) The memory bandwidth may not be enough, “starving“ the microprocessor.
B.) There can be “Cache Trashing”, or cache insufficiency.
C.) There may not be enough Execution Units in your processor to handle both threads, leading to contention.

Solving ‘A’ is easy. The brute force approach will be to increase the FSB frequency. The elegant approach would be to design a better access method to memory and Inter Processor Communication, like what AMD did integrating the memory controller and doing CHT, or what Intel plans to do in Nehalem with CSI and an integrated Memory controller as well.

Solving ‘B’ again is easy. The brute force approach is to have a bigger cache. The elegant solution is to increase the “way-set associative-ness ” of your cache (going from 4 way set associative to 8 way, or 16 way)

Solving ‘C’, on the other hand, requires a brand new processor.

Now look at the differences between a P4 and a Nehalem.

Where the P4 had a FSB with around 533Mhz at the time of HTT introduction, and no integrated memory controller, Nehalem will have CSI and an IMC, and even if they had still a FSB, the FSB is now in the 1600Mhz territory, try to imagine the FSM at Nehalem‘s debut.

Where the P4 had a mere 512KB cache 8 way set associative, Penryn will have around 6MB (shared between two processors). And since the “set associativeness” for the current Core architecture is 16 way, one can only wonder what the cahce size and set-associativeness for Nehalem will be.

Where the P4 had four execution ports, the current Core architecture has 6, and one has to wonder how many of those will Nehalem have.

Couple that with the Dynamic acceleration technology and try to imagine one of your cores lying dormant while the real core, now accelerated serves your single threaded (i.e. gaming) needs, while the virtual core serves your OS and ancillary needs. If on top of that the needs are different at the MicroInstruction level (the game is doing heavy SIMD, while the OS is doing mostly integer) so much the better.

Finally Linux has adapted to HTT and Multicore, MacOS as well (remember, the PowerPC G5 had SMT), and so will be Windows by the time Nehalem debuts.

The analysts and comparers better begin to figure out how much of a speedup will the reintroduction of HTT bring, and factor it in the comparisons.