Sunday, July 09, 2006

AMD64 practical: XP 64 + VMWare Server

There are many inconsistencies and issues with 64 bits in the Windows world, drivers in particular, that prevent people from enjoying the whole capacity of their computers. But there is a practical solution.

First, why would anyone bother to install an AMD64 Operating System?: Because unfortunately the only way to have AMD64 functionality requires the Operating System to work in "Long" mode, that is, native 64 bits.

Now, most people would say that it is irrelevant to have AMD64 functionality if you don't have more than 4 GB in your computer, and I would keep repeating myself saying that such an opinion is totally wrong, because AMD64 is inherently faster:

  • The same applications should run faster with just a recompilation to 64 bits because AMD64 offers twice the number of General Purpose Registers, 16, instead of 8. That means that the processor can work simultaneously with twice the number of temporary values. Furthermore, in x86, among the 8 GPRs are the stack pointer and the frame pointer ("Base Pointer" in x86 Intel's nomenclature), therefore the applications only really have 6 GPRs. But in AMD64 they have 14, more than twice the effective number (learn more about registers at this footnote).
    This feature has many positive consequences:

    • It eases the memory traffic (which is good because there are latencies associated with accessing even the L1 cache) because the values are already there, in registers that the compiler or programer can administer at compile time with more intelligence than the silicon dispatcher or the scratchpad register manager on the fly.

    • It allows more silicon optimizations. Since the values are in registers, they are readily available to all the speculative, reordering, branch prediction, etc., optimizations. If those values were in memory, even in L1 cache, these optimizations couldn't even keep track of them, much less actually factor them out of the critical execution path.

    • It is easier for both the compiler and the programmer to write optimal software. Consider the case of a calculator that can only hold one value: Most of your time will be expent in writing aside the temporary values that you need to remember, we already explained this in the first item, but also, if you could just leave the temporary values in the calculator and recall them when needed, not only your life would be easier, but less error prone too. Exactly the same happens with compilers and programers and a large register file.

  • If you have a fairly powerful computer, like in the range of 2 GB or more of RAM, chances are that you are already hitting the 4 GB wall, because of the virtual memory. The problem is that Windows reserves 2Gb for itself (1Gb if you tweak the system) and therefore the applications fight for the remaining top of 2 GB (3 GB). When windows "feels" that it is running out of memory, what happens is that it gets rid of disk caching space and swaped out memory blocks, that is, it increases its number of direct reads (and writes) to the filesystem instead of the much faster secondary memory accesses (hard disk accesses) to the page file.

  • What if you need data sets larger than 2 GB (3 GB), or simply want to give Windows the luxury of 2 GB of memory disk cache? then, either the Operating System or your applications, or both, will have to deal with the non-trivial complexities of "Page Address Extensions". If you were old enough in the early nineties, you would remember about the horrible nightmares of configuring DOS's low, high, extended and expanded memories; the 4 GB barrier is the dreaded 640 KB redux. As it is today, there aren't too many incentives to put more than 4 gigs in a compy, even though the addition of memory could really speed things up.

  • Finally, if you do integer calculations larger than 32 bits, either the usage of the floating point unit to do them, or the ripple-chaining of twin 32 bit operations are from two to eight times slower.

A 32 bit processor is a 32 bit processor, the patches to support 64 bit features are patches: buggy, error prone, difficult to understand things that are the last recourse. Computer technology is something that should be approached looking at the future, not towing the past to the present.

If you are smart you must be wondering, but aren't there some tradeoffs?, a "price to be paid" to have these benefits? Some people have asked me that perhaps the programs are larger (less memory efficient) if you use 64 bits. The answer is that practically no. AMD64 applications may choose among using 32 bit integers, or 32 bit pointers by default or not. If they default to 32 bits, there is no space overhead other than the truly negligible overhead of passing some 64 bit parameters in the stack and return addresses of system calls.

The AMD64 encoding is almost the same as the 32 bit x86, which is tight. And the "trick" of extending the architecturally visible register file gets accomplished by devoting the single-byte encodings of register INCs and DECs to byte prefixes, that is, the opcodes from 0x40 to 0x4F become the famous REX prefixes that work as mode (64bits/default) and the register class (old/new) selectors for the up to three registers that can participate in an instruction (Register, Base, and Index), with an overhead of one byte, in only some occasions. It is worth noting that it is still possible to do "INCs" and "DECs" of registers using the MOD R/M version (that is, instead of the single byte 0x40 or "INC EAX", the two bytes enconding, 0xFFC0, remains available).

Thus there is practically zero impact in the coding efficiency for AMD64 applications.

The problem with AMD64 is that Intel never wanted it to succeed, for obvious reasons, thus never really tried to make it work seamlessly in their processors, and Microsoft felt too complacent with XP and 32 bits, so, the brunt of making this technology mainstream fell upon puny AMD and unintimidating Free Software projects, with unforgivable marketing incompetence on the part of AMD who failed to market the advantages I mentioned in this article and instead allowed the general public perception to be the misleading "64-bits is only relevant for monsters with more than 4Gb, not my gaming machine".

But the Free Software world wasn't incompetent. Linux and other Free Software projects leveraged their intrinsic advantage of being easy to recompile to recompile themselves in AMD64 obtaining all the benefits expected. That made Microsoft move its ass to get on board before the ship departed, and with Microsoft providing support, Intel didn't have any other choice but to also get on board and took Yamhill (EM64T) out of the closet.

The problem is that neither Intel nor Microsoft really took the x86-64 thing seriously, the former got delayed with XP-64, even XP for AMD64 being relatively trivial to do, and has been failing miserably at guaranteeing acceptance by not porting to 64 bits all the device drivers that helped Windows XP to be so popular.

About Intel, it is commonly accepted that programs run slower in EM64T than at 32 bits; contradicting what I pointed out earlier in the article; but the reason may be very simple: Intel may have implemented EM64T in microcode, thus 64 bit instructions may be sort of "emulated", and naturally much slower. Thanks to the 64 bit advantages, a bit of compensation is obtained and the downstep in performance is not so pronounced.

Without Microsof and Intel seriously supporting the 64 bits technologies, the rest of the market has followed suit, and the adoption has been lacklustre.

Thus, the AMD processor owner is on its own about how to reap the advantages of the 64 bits in a practical way.

The solution to this problem is simple, practical, free of charge, and with a number of positive side effects: Microsoft is giving a 120 days free trial of XP 64 bit, thus you have the practical way to test the waters before making commitments; then leverage yourself with Free and Open Source software. In Windows, important applications such as Java, 7-zip, POV-Ray, Daemon Tools, and even a free (of charge) antivirus, such as Avast! Home edition as can be confirmed here come with versions for 64-bits. The rule of thumb is to leverage Free and Open Source Software as much as possible, because the FOSS has the advantages mentioned above to easily port the applications to AMD64.

To fill the gap that still remains, you can put a 32 bit computer inside your empowered 64 bit computer installing the free (of charge) VMWare Server and preparing a 32 bit Windows XP Virtual computer. It will only take some shared hard disk space and whenever you need the 32 bit computer, its operation will draw only 320MB of RAM memory (or whatever you choose).

I have a 3800 X2 1 GB DDR2 800 in Dual Channel, on top of which the virtual computers run, and my virtual XP-32 machine runs faster than my Turion. Furthermore, I haven't experienced lack of responsiveness in my host XP-64 (the real computer) when my virtual machine is doing heavy stuff such as installing windows updates, and this was before I changed the configuration, the virtual was configured to have a virtual hard disk mapped to a 6 GB file in my SATA2 (300 MB/s) hard disk formatted in NTFS.

So, if I don't want to deal with the hassle of finding a 64 bit driver, I just fire up my 32-bit XP virtual computer and deal with whatever needs it at 32 bits in the virtual computer.

In case you don't know, the virtual computer runs as an application in your host, not even being aware that it is a virtual computer. The guest (virtual) computer has a virtual network adapter that in reality bridges to the host's (to the real computer's network connection), but from the rest of the network, it looks as any other computer. The guest may also use the hardware installed in the host, such as Hard Disks, optical disks, even USBs. Since both the host and guest computers are the same hardware, you don't need to dedicate too much hard disk space to the guest, just provide it with enough space for the operating system and essential applications, the rest, for instance, space to do DVD transcoding, may come from the host through the virtual network, sharing the hard disk space of the host.

Using a virtual computer has another advantages. For instance, I wouldn't install the "Gordian Knot Codec Pack" to transcode DVDs to XViD and DivX in any "serious" computer, because it really messes up with the operating system configuration due to all the codecs and tape-'n-bubble-gum applications glued toghether, but if it is a virtual computer specifically prepared for that, then there is no problem, if it corrupts the whole operating system, well, tough luck, I will expend 5 minutes restoring the last snapshot ;-) Another use is to install testing software, or even software trials. In case you are wondering, the video transcoding operations ran so fast in the virtual computer that I never bothered to benchmark how they would have run in the host. One of these days I will set up an XP-64 bit guest to check whether the applications that I use to transcode run flawlessly in XP-64 and perhaps install them one by one in the host.

All these experiments were so successful, that I opened my wallet to buy another monitor so that I could maximize the virtual computer window in the secondary monitor, and dedicated a (P)ATA-133 hard drive to the guests, to have "portable virtual machines" and to increase my total hard disk drive bandwidth. Now I get the feeling of having powerful computers side by side: One at 64 bits, the important, and the other(s) at 32 bits for quick and dirty stuff.

Currently, I fire up a guest with Knoppix every time I need Linux, and I am preparing a Linux-from-scratch to make it exactly the way I would like it, but this project will take a while at an average of 5 minutes of work per day ;-)

Since I am already tuned into the virtualization wave, I am enthusiastic about the virtualization features that AM2 brings and the further developments of Pacifica/Presidio, which are already vastly superior to Intel's Vanderpool because of the Integrated Memory Controller which allows to virtualize memory and I/O (my article on the subject). Once Woodcrest shows up for real and induces a slide in the price in AMD processors, I will buy a dual socket dual core Opteron compy specifically to run virtualization on them and to centralize all of my hardware.

x86 and AMD64 registers: According to their names and binary code ordering, they are: Accumulator (AX), Counter (CX), Data (DX), Base (BX), Stack Pointer (SP), Base Pointer (BP), Source Index (SI) and Destination Index (DI). Initally these names were significant because every one of these registers had unique roles, but with the cleaning of the Instruction Set in the transition to 32 bits they became almost uniform in properties, although the ESP and EBP (the distinction to refer to the 16 bit part and 32 bit part is to put the "E" prefi x for 32 bits) naturally continue to have very defined and not flexible roles.


Anonymous said...

"...I would keep repeating myself saying that such an opinion is totally wrong, because AMD64 is inherently faster."

How much performance gain can one expect for typical home usage (non-server) applications?

howling2929 said...

Around 30% for 64 bit applications versus the same application in 32bits depending on your processor, and how the applications were compiled (pure AMD64, pure Intel64, conditional execution or least common denominator).

The overall gains will depend on your application mix. This assumes a non-virtualized environment. Of course, YMMV.

Depending on your virtualization solution, you can use some HW in the virtualized environment for which you have no 64 bit drivers (think printers, scaners and cameras, for instance), but for some core HW (chipset, USB, Video chipset), if you do not have the 64bit drivers, you are screwed, and no amount of VMWARE will solve it. Do as chicagrafo says. Get the 120 day trial, and see if your machine is supported....

Now, in another note:

Down to memory line, think on the early days of the Pentium Pro and Win32. If you ran 16bit applications on a PentiumPro, they ran SLOWER than the same on a PentiumClasic. Being the PentiumPro a server and workstation chip, this was not such a big problem.

The "bug" (more a design flaw really) was fixed in the PentiumII (The pentium II was esentialy a PentiumPro, with MMX, the 16bit code issue fixed, and no Multichip package, but the dreaded slot).

32 bit applications will stay with us a long time, so be certain that your old stuff works well before commiting to any of the 64 bit (Intel or AMD´s) solutions. I do not know how, but I will make Reversi for Windows 3.0 work in Vista, like is working on Win2k for me now!

Anonymous said...

Are there some specific benchmarks you can point me to about the 30%? (again not interested in server applications)

I ask because I've had trouble finding them and also given the maturity of Windows XP 64bit and the ridiculous memory usages forecasted for Vista, I would prefer not to upgrade to 64bit unless it provides significant value.

Eddie said...

5% performance improvement.

The reason is that even though you may have a 64 bit application, some library, or some DLL, may still be at 32 bits, and processing is not the bottleneck in today's systems.

The important thing is not how much a benchmark says you gain, the important thing is that you are going to obtain increasing gains as long as more memory is needed for every application, the data sets are larger, and even the calculations need more than 32 bits.

Anonymous said...

Thanks Eddie

While 5% is nice, it is by no way an amount that will drive me to 64bit anytime soon (I probably wouldn't even notice it). I can get 5% from upgrading a CPU (which will likely be cheaper than whatever MS decides to eventuallly sell VISTA for + the adder RAM).