The CPU/Memory Interface

An Introduction to Registers, Caches, Buses, and Chipsets

It doesn’t matter if a CPU runs at 300MHz or 3.0GHz — if it isn’t given any data to process, it’s as useless as a printer waiting for you to refill the paper tray. That’s why, while it may be the brains of the operation, the processor is only one component of a high-performance PC; the most important supporting architecture is the CPU/memory subsystem.

There’s a hierarchy or spectrum of data storage areas between the CPU and system memory, from fastest to slowest or “closest” to “furthest away” (and extending still further, to data that’s not even in memory but must be fetched from a relatively far-off, glacially slow hard disk). The front lines of any CPU access request are the data registers, which are high-speed, temporary storage areas within the CPU itself.

On the Chip: Registers and Cache

Registers hold data to be processed, the results of calculations, or addresses pointing to the location of desired data; they’re of varying number, type, and size depending on the CPU design. For example, the Pentium 4 has 32 registers, split into four groups of eight (for x86 compatibility) that range from 32 to 128 bits in size and are split between various data types and tasks.

The CPU can act on data in the registers virtually instantaneously, but the registers are far too small to hold all the data required. This is also one of the most expensive areas of a CPU to implement, so it’s extremely rare that a processor redesign would stop at simply adding more internal registers (although it’s a tantalizing possibility; at least in theory, you could create a hyper-threading CPU by doubling the registers and possibly the cache while leaving the CPU core mostly untouched).

In order to provide an intermediary between the CPU registers and slower system memory, modern processors include varying amounts of data buffer or cache memory. Larger and slower than the registers, CPU caches are temporary storage areas usually broken down into Level 1 (L1), Level 2 (L2), and sometimes Level 3 (L3) caches, getting slower and less expensive as you move outward.

Most current desktop and notebook processors have an L1 cache at core level and an L2 cache on die (elsewhere on the chip). Where present (usually in servers), Level 3 cache can be either on-chip — as in Intel’s Itanium 2 — or off-chip, on the motherboard or an expansion card.

Most desktop processors follow a standard configuration pairing a single L1 and L2 cache, though there are variations such as the forthcoming AMD Hammer’s separate L1 instruction and data caches. AMD’s Athlon XP and Intel’s Pentium 4 share the standard L1/L2 design, though in very different ways: The current P4 combines an 8K Level 1 (plus tiny execution-trace) cache and 512K Level 2 cache (older versions had only 256K of L2), while the Athlon XP has a 128K Level 1 and 256K Level 2 configuration.

At first glance, the Athlon XP would seem to have the more robust design, especially given the small 8K L1 cache of its Intel rival. Looking at cache size is only part of the story, however, as the Pentium 4’s internal cache has a broad 256-bit data pathway, while the Athlon XP has only 64-bit pathways.

To use the two essential buzzwords, the Pentium 4’s smaller Level 1 cache allows it to maintain low cache latency (response time), but also yields a lower hit rate (the percentage of time a requested piece of data is in the cache, ready for access). Conversely, the Athlon XP’s 128K Level 1 cache will have a higher hit rate, but can’t match the Intel chip’s L1 cache for latency.

This story is somewhat reversed when looking at Level 2 cache, as both theoretical design and hands-on testing show the 512K cache of the Pentium 4 is more adept at handling larger data sets than the 256K cache of the Athlon XP. Successfully predicting which data the CPU will want to be retrieved or processed can also yield performance gains, and both internal (branch prediction) and external (data prefetch) mechanisms play a role.

Going Off-Chip

The middle ground between the CPU and vast but slow disk storage, of course, is system memory. The main determinant in memory performance or responsiveness is the speed and size of the CPU bus, followed by the memory bus and memory speed and type.

This is only logical, since (to go back to the printer analogy) a laser that can print 20 pages a minute won’t be well-matched with a paper feeder that can deliver either 10 or 30. This is the reason the old experiment of pairing a 133MHz-system-bus Pentium III with DDR266 memory didn’t yield noticeable performance gains, while matching today’s 533MHz-bus Pentium 4 with DDR266 is almost criminal, leaving the processor starved for memory bandwidth (but the P4 responds handsomely when you replace DDR266 with DDR333 or DDR400 memory).

The optimal goal for achieving fast memory performance is to match the speed of the CPU bus with that of the memory bus. This is called synchronous operation, such as occurs with a 266MHz-bus Athlon XP 2200+ running on a DDR266 platform. Running asynchronously can also work, but there will be performance tradeoffs — for instance, loading that 266MHz-bus Athlon XP with DDR333 or DDR400 is pretty much wasting money on the more expensive, faster memory.

There are ways around this limitation, and some are as simple as AMD’s recent step up to a 333MHz system bus for the Athlon XP 2700+ and 2800+. This resulted in a 5- to 15-percent gain in overall system performance, just based on the higher memory bandwidth and its synchronous relationship or smooth mesh with DDR333. The asynchronous setup of current Pentium 4 DDR platforms is not a problem per se, but even DDR400 memory (maximum 3.2GB/sec) cannot supply the bandwidth that a synchronous mating of the Pentium 4’s 533MHz bus and PC1066 RDRAM platform can (4.2GB/sec).

Where the Chipset Comes In

Enhancing the Northbridge component of the motherboard chipset is also a popular strategy, as this is the hub that coordinates memory traffic and can hence help or hinder overall performance.

Standard Northbridge memory controllers utilize a single path to system memory, which in the case of DDR translates into a 64-bit link running speeds of 200MHz to 400MHz. A different Northbridge can’t do much about theoretical bandwidth, but improvements can be made to lower memory timings and access latencies. This is one area that VIA Technologies has really pushed with its KT series of AMD chipsets; the “performance-oriented” design that emerged after the original KT266 has really paid dividends as far as higher memory throughput for the company’s KT266A through KT400 products.

A more fundamental change is to implement a dual-channel design into the memory controller, thereby providing two data paths between the system memory and chipset. This really is as simple as it sounds, but the actual designs can be very different. For example, the Intel 850E uses a dual-channel link to RDRAM memory, but due to the nature of the latter, the initial design called for two 16-bit paths and required RDRAM modules to be installed in pairs. Newer RIMM 4200 (32-bit) RDRAM changed this limitation and supports single-module use.

Due to its smaller 16/32-bit data pathways, RDRAM needs to run at higher speeds than DDR to keep up — a single channel of even exotic PC1066 RDRAM can only match the 2.1GB/sec bandwidth of standard DDR266, thanks to the latter’s 64-bit path. That is why the i850’s dual-channel format (doubling bandwidth to 4.2GB/sec) has been so integral to the performance success of the Pentium 4 RDRAM platform — and also why a dual-channel DDR memory controller, yielding a 128-bit memory path, is a dream of the performance crowd.

Nvidia’s nForce and nForce2 chipsets offer dual-channel DDR support for the AMD camp, while Intel is expected to unveil a dual-channel DDR Pentium 4 chipset (you may have heard the codename Granite Bay) soon.

The speed of the CPU/memory subsystem will continue to increase as time goes by. AMD’s forthcoming Hammer/Opteron processors will incorporate a dual-channel DDR memory controller onto the chip itself, doing away with an external Northbridge controller, while dual-DDR chipsets from many vendors will proliferate and CPU caches grow ever larger (AMD’s “Barton” variant will double the Athlon XP’s Level 2 cache to 512K, while Intel’s “Banias” and “Prescott” are expected to raise the L2 ante to 1MB). With CPU clock rates rising ever higher, it’s all memory and chipset manufacturers can do to keep pace.