AMD Puts the Pedal to the Metal
The AMD Opteron represents a major shift in platform design and processor architecture — both important pieces of the 64-bit puzzle. Last week, we surveyed the newly available processor models and their hardware platforms and market positioning. This week, we’ll dig into a deeper look at what makes AMD’s new server and workstation CPU tick.
AMD has positioned the Opteron as the solution to many system needs, with the primary goal of providing a 64-bit physical architecture while supplying high-end performance for both 64- and 32-bit software. This translates into architectural advantages such as 64-bit data and address pathways, upgraded physical and virtual memory addressing, and a true 64-bit internal design.
The other main innovation has been to move key Northbridge functions from the system chipset directly into the Opteron core. These include a memory controller, multiprocessing control, and data flow, along with a bridge to peripheral data traffic. Traditional Southbridge and AGP components are still present in the Opteron architecture, but AMD’s eighth-generation processor has absconded with the main performance and CPU-centric duties.
Opteron Microarchitecture
The Opteron core resembles the basic design of the Athlon XP, but the move to a 64-bit architecture has brought some inherent advantages. Both the Opteron and Athlon XP contain a few similar features, such as 64K apiece of Level 1 data and instruction cache and three apiece of integer and floating-point units, but there have been some noted improvements elsewhere. In terms of basic features, the Opteron includes a full 1MB of Level 2 cache on the inside, along with an integrated heat spreader and new Socket 940 packaging on the outside.
Looking a bit deeper, AMD has improved on its seventh-generation design in other ways. A processor’s registers are like miniature cache areas where crucial data is stored and retrieved; the Opteron features eight more general-purpose registers, and these have been extended to 64 bits. AMD has also added eight 128-bit Streaming SIMD Extension (SSE) registers for multimedia instructions, as well as compatibility with the SSE2 instructions that premiered in Intel’s Pentium 4.
The chip’s transaction look-aside buffers are larger and offer lower latencies than those of the Athlon XP. Branch prediction is also enhanced, including an increase to 16K bimodal/history counters, or four times the level found on the Athlon XP.
This last note is important, because in order to provide higher frequencies and better scalability, AMD has extended the Opteron pipelines. The Opteron features a 12-stage integer operation pipeline (versus 10 stages for the Athlon XP) and a 17-stage floating-point operation pipeline (versus 15 for the Athlon XP). While this pays dividends on higher potential clock speeds, it also incurs a risk of increased prediction misses, so AMD has adjusted the architecture to provide even higher pipeline efficiencies than the Athlon XP.
The Opteron also has built-in core logic to support multiprocessor systems without the need for a Northbridge chip. Internal CPU data traffic is all routed through a crossbar (XBAR) communications architecture, which shuttles command and data information between the CPU, memory controller, and three HyperTransport links. This is a huge technological leap for multiprocessor workstation and server designs, as it provides a true standard for OEMs to work with, and takes the Northbridge component out of the equation.
Dual-Channel Memory, More Or Less
The AMD Opteron includes an integrated memory controller, capable of supporting DDR200 through DDR333 speeds and a maximum of eight DIMM memory modules per processor. The controller provides up to 5.3GB/sec of memory bandwidth (with 333MHz DDR), yielding higher memory performance, lower memory latencies, and performance levels that can scale to processor frequencies.
Since each CPU has its own memory controller, memory bandwidth will also scale in multiprocessor systems. For example, a 2-way Opteron workstation will yield 10.6GB/sec of memory bandwidth, while a 4-way Opteron server will double this again to an incredible 21.3GB/sec, along with supporting up to 32 DDR DIMMs.
The Opteron’s integrated memory controller has been referred to as a dual-channel design, but this isn’t the exact truth. It certainly delivers double the bandwidth of a single-channel controller, but does so by taking two 64-bit DDR modules and viewing them as a single 128-bit DIMM with a corresponding 128-bit data path. This is similar to the design of Intel’s dual-channel DDR chipsets such as the E7205 and 875P, but different than the true dual-channel memory architecture of the Nvidia nForce2.
This is actually a smart call when it comes to building an integrated memory controller, as for all intents and purposes, the bandwidth and performance are equivalent, but the 128-bit memory bus is more streamlined. In the Opteron architecture, there is no need for an arbiter chip to handle traffic along the dual physical memory channels, and no requirement for extra controller hardware. Of course, due to the “single-channel 128-bit” memory architecture, the pairs of DDR modules but be matched in size, speed, and chip-count, though not necessarily in manufacturer.
The AMD64 Platform
AMD’s ballyhooed AMD64 (nee x86-64) platform represents a new class of computer, with the key element being native support for both 32- and 64-bit software. AMD performs this magic trick through different operating modes, depending on the operating system and applications being run.
At the top level comes LMA, or Long Mode Active, which signifies the presence of a 64-bit operating system that allows the Opteron to use its 64-bit extensions. With LMA enabled, the Opteron requires a fully 64-bit operating system, but supports 16- and 32- as well as 64-bit applications.
Within the LMA Active subsection, there are different submodes, including compatibility modes for 16- and 32-bit operation (again, under a 64-bit OS) and the full 64-bit mode, in which the Opteron struts its stuff with both a 64-bit operating system and applications. When LMA is disabled, the Opteron runs like a standard or “legacy” x86 CPU, and is fully compatible with both 16-bit and 32-bit operating systems and software.
The 64-bit mode has inherent performance advantages, but AMD suggests that 32-bit compatibility mode shows the most performance gain for existing software. With a 64-bit operating system, the Opteron is free to alleviate formerly OS-based memory constraints while still providing high performance for 16, 32 and 64-bit software.
Software developers are helping the transition to 64-bit operating systems through the use of nonuniform memory access — NUMA for short. This beneficial feature is a form of advanced multiprocessor data control and memory management that allows the OS to distribute processing and data transfers more efficiently, rather than constantly forcing data back and forth between CPUs and memory. This can reduce data traffic between CPUs, and increase overall system performance. Currently the AMD64-compatible SuSE Linux distribution includes NUMA, and Microsoft is incorporating it into the forthcoming AMD64 version of Windows Server 2003.
HyperTransport: The Bridges of Opteron County
Next to its 32/64-bit balancing act, the star of the Opteron show is the HyperTransport input/output bus. The CPU incorporates three HyperTransport data links (two for communication between processors, one for the rest of the system), while this fall’s uniprocessor Athlon 64 will include only one for the system bus. The Opteron is a multiprocessor design, and requires the additional two HyperTransport links for communication between processors. The following diagrams show how XBAR and HyperTransport cooperate within one CPU, then scale to a four-way multiprocessor system.
Using HyperTransport has allowed AMD to eliminate the traditional front-side bus, and separate I/O data from memory data. This innovative design supplies what AMD refers to as “glueless multiprocessing”, or the ability to link multiple processors without external logic. Previously, this was performed through a motherboard chipset, but the Opteron includes the hardware on-chip, and uses HyperTransport to facilitate data transfers.
The dual HyperTransport links for CPU data are coherent links that share the memory and cache space between two processors. This is the basic architecture of any multiprocessor system, but in the case of the Opteron, HyperTransport — not a Northbridge chipset — provides the coherent memory/cache link between CPUs. The third HyperTransport pipe is a noncoherent link or simply a channel to get system data to and from the CPU. Again, this piece of the architecture will be featured on the Athlon 64 desktop processor, much as today’s nForce2 chipset incorporates HyperTransport as its system bus.
HyperTransport offers many advantages over conventional solutions such as PCI. A single Opteron HyperTransport link offers a whopping 6.4GB/sec of data bandwidth, which is top of the scale in terms of current x86 system bus specifications. Compatibility is also high, and PCI and PCI-X can coexist on a HyperTransport system bus while taking advantage of its ample bandwidth.
HyperTransport is also inexpensive, based on a simple design, and highly scalable. These factors all contribute to a promising technology, especially given that platform designs badly need additional bandwidth now that Gigabit Ethernet, PCI-X, and other data-hungry hardware are all fighting for system resources.
Welcome to the New World
The Opteron is a refreshing change from the old x86 processor treadmill, and is actually the first CPU in recent memory to deserve the title of a next-generation product. Its hybrid 32/64-bit design ensures backward compatibility, while offering performance and platform advances when the need for 64-bit software takes hold, thereby making it a wise choice for the future. In addition to making an offer that IT managers may not be able to refuse, AMD has taken its destiny firmly in hand by incorporating critical Northbridge functions into the Opteron core — now less dependent on third-party chipsets, AMD can once again make its processor the star of the show.