It has been a while since I posted an article, so, let me begin with some bits of news:
1) I am particularly busy on my day job, doing high performance mulithreaded software
2) I am evaluating Windows 7, of which I have bits and pieces for an article
3) I have studied nVidia's CUDA, of which I would be posting some articles too
4) After more than a decade in which all my new processors have been AMD, I switched to Intel, I bought a Nehalem, a Core i7 920.
This last item put me to think and deserves a nostalgia trip, here it is:
The last time I felt motivated to buy a new processor from Intel was a Celeron 300A, because, if you remember, this was Celeron in price but Pentium II in performance. Intel made the mistake of launching the original Celerons as processors that underperformed even the existing Pentium MMX of the day, but still requiring a much more expensive motherboard, so, they only made sense for people who wanted to eventually buy a Pentium II and were willing to put up for a while with a system slower than the old generation, although this wasn't a good plan because at the time to replace the processor, how would you get money from your spare Celeron?
The reason why the original Celerons were so bad, was that they didn't have any L2 cache. They were overperformed by Pentium MMX because they were, for all practical purposes, Pentium II inside and the Pentium motherboards had about 512 KB of cache (external to the processor), whereas the Pentium II (and Celeron) motherboards assumed the L2 cache was going to be in the processor and, because of the diminishing returns of a third cache level, did not have any cache at all. You may check these claims starting with the wikipedia entry:
Then, Intel sort of over-killed the problem with the first line of Celeron processors with cache, because they were super performing, over-clocker kings at devalued Celeron prices.
Since there was a no-cache Celeron of 300 MHz (Covington architecture), and a Celeron with cache (Mendocino) of 300 MHz, its model was called Celeron 300A, with the "A" to differentiate among them.
It turns out that a Celeron 300A (66 MHz external bus) in some benchmarks (without overclocking!) was faster to the Pentium II 300 MHz (100 MHz external bus) because the Pentium II, although it had a 512KB L2 cache, it ran at half the speed of the processor, while the 128 KB L2 cache of the Celeron 300A was running full speed. The reason for this was that the L2 cache of Pentium IIs was off-chip, while the "Mendocino" Celeron cache was on-chip (by the way, this makes it the first massively produced processor with an on-chip L2 cache). It seems that the Mendocino architecture was a precursor for Intel's fully-on-die L2 cache Coppermine architecture (the Pentium III with 256 KB of L2 cache). If if you remember Coppermine trounced Katamai Pentium III that had double the L2 cache size precisely because their cache ran at full speed, so, the Celeron 300A versus Pentium II was about the same thing: L2 cache running at twice the speed, but 1/4 of the size.
I was saying that I feel it was a precursor to the Coppermine, because they all overclocked like crazy. And it was very easy to overclock them, you just forced the motherboard to use 100 MHz of bus frequency and the processor went from the specified 300 MHz to 450 MHz.
I gave that processor to a friend of mine, that I understand is still running it today, at 450 MHz.
Before that, I was very happy with my Pentium MMX 166 MHz, great processor; and at the time I didn't like none of the cheapos AMD K5, K6 and K6-II
I gave up about Celeron 300A because I bought an Athlon 550 MHz and got blown away by its performance and also its performance/price, which then was improved further by the impressive performance/price ratio of their "Duron" siblings; while Intel came out with the ragingly fast eunuch of the Pentium 4 that were routinely overperformed by properly configured Athlon. Although sometimes a Pentium 4 may have been faster, still, the price difference didn't justify it.
So, it was Athlon/Duron for everyhing, I bought a used laptop (with Intel processor, but this doesn't count because it wasn't a new processor), then I bought a mobile Sempron laptop because it was the best peformance/price ratio, then, when I wanted a new laptop, I took the hard decision of buying a Turion 64 over a Pentium M only because I wanted to move to 64 bits as soon as possible, as I have said before, the AMD64 (or Intel's EM64T) should help compilers to make the same programs run about 5% faster (things like the doubling of the architectural register file (from 8 registers to 16) should overcompensate the extra memory inefficiency).
Over the years, the performance and performance/price of AMD processors went ever higher and higher while Intel's were lower and lower, with the exception of the Pentium M series.
As we all know, Intel came back with the Core µ-architecture (Core 2 Duo) and demonstrated that plain ole P6 had a lot of life, but still, it wasn't convincing for me to switch. But further down the road it came Nehalem, and it was more than enough for me to switch, probably for a very long while.
The Core i7 920 priced at less than $300 is actually very cheap for what it is, but it forces you to pay premium prices on the motherboard (currently there is only one chipset for it, the X58 from Intel) and it forces you to buy DDR3 memory before their final price-wise commoditization, and these really hurt the performance/price ratio you can get from a Nehalem; but what really pushed me over is that although this is just a quadcore, it really feels like an octo-core. Nehalems have the feature of "Hyperthreading", that is, run two instruction streams per core. As far as I have seen, hyperthreading REALLY works on Nehalem.
If you remember, when Intel introduced Hyperthreading, it didn't work, people hated it, and Intel abandoned it for the subsequent µ-architectures until Nehalem. I will explain in a new article why it didn't work and why it now works, but there are still a few things I need to explain about Nehalem:
Intuitively, to feed all the data needed by 8 threads executing at full blast will require quite a lot of bandwidth and slow latencies, that's why I have been so skeptical even about Intel's current double-duals that only need to feed 4 threads. Also, I perceived the whole thing of FSB as a dead end and I wasn't going to grace Intel's technical mistakes with my money by acquiring a FSB-based motherboard. Just to illustrate the point, following the article at
you can see that a nominally 1333 MHz FSB (that runs at 333 MHz "quad pumping" (capable of 4 transfers per cycle) DDR2 memory) with a 64 bit (8 bytes) width, can only give 8 * 1333 E6 Bytes/sec, or 10.656 GB/sec, while DDR2-800 memory, in dual channel gives 2*8*800 E6 bytes per second, 12.8 GB/sec. Meaning that even the 1333 MHz FSB can't fully use a dual channel DDR2-800 configuration!. I am not really sure that I am right about this, but it seems to me that official pages such as
that claim bandwidths over 10.656 GB/sec are a lie, at least I am not able to explain how a 64-bit wide FSB at 1333 MHz (nominal) may do it. Even recent benchmarking pages such as
with measurements using the SANDRA memory bandwith show how FSB-based systems can't go above 7.5 GB/sec even being quadcore extremes using DDR3 and 1600 MHz FSB. Interestingly enough, you may see how all the AMD models have higher measured bandwith than any FSB based system.
tells us, AMD's integrated memory controller began with the socket 754 processors, featuring 3.2 GB/sec, which was a real number you could get in practice, that then was superseded by socket 939 processors with support for dual channel DDR-400 (6.4 GB/sec), which then were superseded by socket AM2 with support for up to 12.8 GB/sec.
As you know, my Core i7 920 is 2.66 GHz, capable of driving DDR3 memory at 1066 MHz of nominal speed, but in triple channel configuration, giving it 8*3*1066 MB/sec, 25.584 GB/sec of bandwidth... without the FSB contention nor FSB latencies problem of double duals, sure lyenough to feed all 8 threads. Also, as this domestic screen capture shows:
The cache hierarchy is right: 32KB+32KB L1, 256KB L2 and 8MB shared L3, this would give a progression per thread of 32, 128, 1024, of which the greatest step is the third level where the data is shared. The L1 and L2 sizes look small, to tell you the truth, but if you really have 8 threads running, chances are that you have some running the same algorithm in a data-partitioned scheme, of which they will be sharing data, which improves the effect of the L3 cache.
Since Intel now comes with 3D acceleration in its processors (by the way, sooner than AMD's Fusion), and especially important for me, 8-core systems, I think I won't have any interest on AMD processors for a while. Not even curiosities like the triple cores, that disappointed me because they consume as much power than an X4, or may have errors like the L3 TLB bug...