April 30, 2003 Opteron: Under the Hood By Vince Freeman
AMD Puts the Pedal to the Metal
The AMD Opteron represents a major shift in platform design and processor architecture -- both important pieces of the 64-bit puzzle. Last week, we surveyed the newly available processor models and their hardware platforms and market positioning. This week, we'll dig into a deeper look at what makes AMD's new server and workstation CPU tick.
AMD has positioned the Opteron as the solution to many system needs, with the primary goal of providing a 64-bit physical architecture while supplying high-end performance for both 64- and 32-bit software. This translates into architectural advantages such as 64-bit data and address pathways, upgraded physical and virtual memory addressing, and a true 64-bit internal design.
The other main innovation has been to move key Northbridge functions from the system chipset directly into the Opteron core. These include a memory controller, multiprocessing control, and data flow, along with a bridge to peripheral data traffic. Traditional Southbridge and AGP components are still present in the Opteron architecture, but AMD's eighth-generation processor has absconded with the main performance and CPU-centric duties.
Opteron Microarchitecture
The Opteron core resembles the basic design of the Athlon XP, but the move to a 64-bit architecture has brought some inherent advantages. Both the Opteron and Athlon XP contain a few similar features, such as 64K apiece of Level 1 data and instruction cache and three apiece of integer and floating-point units, but there have been some noted improvements elsewhere. In terms of basic features, the Opteron includes a full 1MB of Level 2 cache on the inside, along with an integrated heat spreader and new Socket 940 packaging on the outside.
Looking a bit deeper, AMD has improved on its seventh-generation design in other ways. A processor's registers are like miniature cache areas where crucial data is stored and retrieved; the Opteron features eight more general-purpose registers, and these have been extended to 64 bits. AMD has also added eight 128-bit Streaming SIMD Extension (SSE) registers for multimedia instructions, as well as compatibility with the SSE2 instructions that premiered in Intel's Pentium 4.
The chip's transaction look-aside buffers are larger and offer lower latencies than those of the Athlon XP. Branch prediction is also enhanced, including an increase to 16K bimodal/history counters, or four times the level found on the Athlon XP.
This last note is important, because in order to provide higher frequencies and better scalability, AMD has extended the Opteron pipelines. The Opteron features a 12-stage integer operation pipeline (versus 10 stages for the Athlon XP) and a 17-stage floating-point operation pipeline (versus 15 for the Athlon XP). While this pays dividends on higher potential clock speeds, it also incurs a risk of increased prediction misses, so AMD has adjusted the architecture to provide even higher pipeline efficiencies than the Athlon XP.
The Opteron also has built-in core logic to support multiprocessor systems without the need for a Northbridge chip. Internal CPU data traffic is all routed through a crossbar (XBAR) communications architecture, which shuttles command and data information between the CPU, memory controller, and three HyperTransport links. This is a huge technological leap for multiprocessor workstation and server designs, as it provides a true standard for OEMs to work with, and takes the Northbridge component out of the equation.
Dual-Channel Memory, More Or Less
The AMD Opteron includes an integrated memory controller, capable of supporting DDR200 through DDR333 speeds and a maximum of eight DIMM memory modules per processor. The controller provides up to 5.3GB/sec of memory bandwidth (with 333MHz DDR), yielding higher memory performance, lower memory latencies, and performance levels that can scale to processor frequencies.
Since each CPU has its own memory controller, memory bandwidth will also scale in multiprocessor systems. For example, a 2-way Opteron workstation will yield 10.6GB/sec of memory bandwidth, while a 4-way Opteron server will double this again to an incredible 21.3GB/sec, along with supporting up to 32 DDR DIMMs.
The Opteron's integrated memory controller has been referred to as a dual-channel design, but this isn't the exact truth. It certainly delivers double the bandwidth of a single-channel controller, but does so by taking two 64-bit DDR modules and viewing them as a single 128-bit DIMM with a corresponding 128-bit data path. This is similar to the design of Intel's dual-channel DDR chipsets such as the E7205 and 875P, but different than the true dual-channel memory architecture of the Nvidia nForce2.
This is actually a smart call when it comes to building an integrated memory controller, as for all intents and purposes, the bandwidth and performance are equivalent, but the 128-bit memory bus is more streamlined. In the Opteron architecture, there is no need for an arbiter chip to handle traffic along the dual physical memory channels, and no requirement for extra controller hardware. Of course, due to the "single-channel 128-bit" memory architecture, the pairs of DDR modules but be matched in size, speed, and chip-count, though not necessarily in manufacturer.