Team Effort: Clustering 101

Tag-Team Processing Offers Supercomputing on a Shoestring

Sometimes one processor, not matter how powerful, just isn’t enough. In fact, sometimes not even two Opterons or four Itaniums will get the job done. If you want to model seismic activity along the west coast to predict when “the big one” is coming, or track every economic indicator in the U.S. to see if the dollar will weaken against the yen, your job involves so many variables and such complicated algorithms that not even the largest server can solve them.

In this case, you have two choices: invest in a supercomputer, which takes years to build and costs millions of dollars, or string together a series of less powerful systems and achieve nearly the same performance for a fraction of the cost.

The National Center for Supercomputing Applications (NCSA) is taking the second path as it tries to determine nothing less than the origin of the universe: Rather than a single, huge supercomputer, the Illinois-based center is building a cluster. This fall, NCSA will connect more than more than 1,280 Dell PowerEdge 1750 servers with two Intel Xeon CPUs apiece, running Red Hat Linux and linked by Myricom’s Myrinet 2000 interconnect technology. The result should yield a peak performance of 17.7 teraflops (trillion floating point operations per second) — as of today, fast enough to make it the third most powerful supercomputer on Earth. And you can order all of the parts online.

The Promise of Clustering

The clustering approach is important for several reasons. First, it provides a cost-effective way to get more computing power from existing PCs and workstations. Second, it’s one of the few tools available to tackle ultra-complex problems such as weather and economic modeling.

Finally, it makes supercomputing-class power more available to businesses and institutions worldwide. When you think of supercomputing, do you think of the Cray brand? If you check out Top500, which tracks the strongest supercomputers in the world, you’ll find it barely in the top 50 — increasingly eclipsed by clusters of systems using commercially available Xeon and Itanium processors.

Clustering means different things to different people, but at the most basic level it involves connecting multiple computers so they work as a single system — while multiprocessing involves two or more CPUs in one machine, clustering involves two or more machines (each of which, in turn, may be a multiprocessing system).

The most common goals of clustering are load-balancing and ensuring high-availability computing. On the latter score, despite the demise of the dot-com economy, there are still thousands of dot-coms that want to keep their sites up 99.99 percent of the time — which is about two hours of downtime per year. With clustered systems, a backup is always running in real time if a failure occurs. In a so-called Web farm, if one server’s CPU overheats or hard disk crashes, other servers in the cluster proceed without skipping a beat.

IT managers can also set scripts for load balancing, so that if one server gets overworked and starts to slow down, another can pick up the slack. Clustering enables companies to scale their networks as their company grows — a business might start with a relatively small, four-system server cluster, but as traffic increases, it can add more to handle the load without changing its entire architecture.

Cluster vs. Grid

Although these workday tasks are clustering’s greatest hits, another application often gets more press: grid computing. The two terms are often used interchangeably — both involve multiple systems working together to carry out a similar set of functions — but there are differences. You can think of a cluster as grid computing under one roof: One company or department sets up a cluster and controls the whole, usually localized or centralized, system.

Grid computing is more far-reaching; individual systems can be added or subtracted without a central control. What’s more, miles can separate grid participants as long as there’s a network connection between them. An example on a massive — nay, cosmic — scale is the SETI@Home project, which enlists PC users all over the Internet to download a screen saver that uses extra clock cycles to sort through radio telescope data in search of signs of life in deep space.

The People’s Cluster: Beowulf

One of the earliest examples of clustering was 1994’s Beowulf Project, which connected 16 Intel DX4-based PCs via 10Mbps Ethernet; one PC acted as the master and user interface, with the others serving as slaves used solely for computation. Faster CPUs and networking technologies have been plugged into the same framework, which remains popular today as a sort of compromise between massively parallel processing and mere networks of workstations whose nodes may be available for other tasks.

While Beowulf technology is open-source, commercial versions that simplify the installation and configuration process are available from companies ranging from HP to Scyld Computing Corp. and Northrop Grumman. Recently, AMD announced that Scyld’s cluster OS will be customized to support forthcoming Opteron server processors, allowing 64-bit clusters using an enhanced Linux kernel.

Beowulf-like clustering can be used to create a system using standard PC, server, and workstation components that rivals the muscle of a supercomputer for tasks like scientific investigations and sophisticated modeling. Although some companies like IBM are trying to apply grid technology to symmetric multiprocessing (SMP) environments to run enterprise applications, for now grid computing is best at tasks that involve SETI@home-style parallel processing — jobs like modeling or genome sequencing that can be divided into parts and distributed across the grid, with results to be combined later. Accessing a single database isn’t well suited to a cluster, but complex data mining is.

Putting It Together

Clustering is often thought of as a Linux or Unix application, but Windows XP and Mac OS can be used in clusters, too; Web-server clusters often use off-the-shelf operating systems and software. More scientifically oriented applications tend to be written for whatever platform they run on.

Practically every computer company you can name, from IBM and Sun to HP and Microsoft, has invested in clustered solutions. According to IDC Research, Dell is the current market leader in x86 supercomputer clusters, with revenue of $65 million last year; IBM is close behind with $60 million and HP earned $48 million. Even Apple now offers a cluster-ready version of its Xserve server with dual 1.33GHz PowerPC G4 processors and a Gigabit Ethernet port.

To get the best performance from any cluster, it helps to use the fastest available CPUs and connections between systems, but clusters by nature are more than the sum of their parts. Just as grid computing solutions can accept the computational contributions of big servers and humble desktops alike, old Pentium IIIs and PowerPC chips have been strung together to create some very affordable high-performance computing systems. So think twice before you throw out that old PC — depending on the applications you need done, it might be reborn as part of a cluster.