0
Photogallery

An Overview of the IBM Power 775 Supercomputer Water Cooling System OPEN ACCESS

[+] Author and Article Information
Michael J. Ellsworth

Senior Technical Staff Member Fellow, ASME Advanced Thermal Laboratory,  IBM Corporation, Poughkeepsie, NY 12601mje@us.ibm.com

Gary F. Goth

Senior Technical Staff Member Thermal Engineering and Technologies,  IBM Corporation, Poughkeepsie, NY 12601gfgoth@us.ibm.com

Randy J. Zoodsma

Senior Engineer Thermal Engineering and Technologies,  IBM Corporation, Poughkeepsie, NY 12601rzoodsma@us.ibm.com

Amilcar Arvelo

Advisory EngineerThermal Engineering and Technologies,  IBM Corporation, Poughkeepsie, NY 12601arvelo@us.ibm.com

Levi A. Campbell

Advisory Engineer Advanced Thermal Laboratory,  IBM Corporation, Poughkeepsie, NY 12601levic@us.ibm.com

William J. Anderl

Advisory Engineer  Mechanical Design and Cooling, Midrange Systems, IBM Corporation, Rochester, MN 55901wja@us.ibm.com

J. Electron. Packag 134(2), 020906 (Jun 11, 2012) (9 pages) doi:10.1115/1.4006140 History: Received August 01, 2011; Revised February 13, 2012; Published June 11, 2012; Online June 11, 2012

In 2008 IBM reintroduced water cooling technology into its high performance computing platform, the Power 575 Supercomputing node/system. Water cooled cold plates were used to cool the processor modules which represented about half of the total system (rack) heat load. An air-to-liquid heat exchanger was also mounted in the rear door of the rack to remove a significant fraction of the other half of the rack heat load: the heat load to air. The next generation of this platform, the Power 775 Supercomputing node/system, is a monumental leap forward in computing performance and energy efficiency. The computer node and system were designed from the start with water cooling in mind. The result, a system with greater than 96% of its heat load conducted directly to water, is a system that, together with a rear door heat exchanger, removes 100% of its heat load to water with no requirement for room air conditioning. In addition to the processor, the memory, power conversion, and I/O electronics conduct their heat to water. Included within the framework of the system is a disk storage unit (disc enclosure) containing an interboard air-to-water heat exchanger. This paper will give an overview of the water cooling system featuring the water conditioning unit and rack manifolds. Advances in technology over this system’s predecessor will be highlighted. An overview of the cooling assemblies within the server drawer (i.e., central electronics complex,) the disc enclosure, and the centralized (bulk) power conversion system will also be given. Furthermore, techniques to enhance performance and energy efficiency will also be described.

The meteoric rise in the cooling requirements of commercial computer products has been driven by an exponential increase in microprocessor performance over the last decade. Almost all of the electrical energy consumed by the chip package is released into the surroundings as heat, which places an enormous burden on the cooling system. Existing cooling technology typically utilizes air to carry the heat away from the chip, and reject it to the ambient. Heat sinks with heat pipes or vapor chambers are the most commonly used air-cooling devices. Such air cooling techniques are inherently limited with respect to their ability to extract heat from semiconductor devices with high heat fluxes and carry heat away from server nodes that have high power densities. In addition to these heat transfer performance related inadequacies, air cooling is also more energy intensive. Exorbitant amounts of pumping power can be consumed in the cooling of extremely high power server nodes, especially when the inlet air temperatures are high. Thus, the need to cool current and future high heat load high heat flux electronics mandates the development of extremely aggressive and highly energy efficient thermal management techniques, such as liquid cooling using cold plate devices.

Liquid cooling is not, in and of itself, a new technology. The need to further increase packaging density and reduce the signal delay between communicating circuits led to the development of multichip modules, beginning in the late 1970s. The heat flux associated with bipolar circuit technologies steadily increased from the very beginning and really increased in the 1980s [1]. IBM had determined that the most effective way to manage chip temperatures in these systems was through the use of indirect water-cooling [2]. Several other mainframe manufacturers also came to the same conclusion [3-8].

The decision to switch from bipolar to complementary metal oxide semiconductor based circuit technology in the early 1990s led to a significant reduction in power dissipation and a return to totally air-cooled machines. However, this was but a brief respite as power and packaging density rapidly increased, matching and then exceeding the performance of the earlier bipolar machines. These increased packaging density and power levels have resulted in unprecedented cooling demands at the package, system, and data center levels requiring a return to water-cooling.

In 2008 IBM introduced the water cooled Power 575 Supercomputing Node, which is packaged in a super-dense 2U (88.9 mm) form factor [1,9]. Users can deploy up to 14 server nodes in a single frame (or rack), with 16 processor modules within each node operating at 4.7 GHz. A fully configured system can dissipate as much as 72 kW with 80% of the heat load going to water. There were three important reasons to use water cooling in lieu of air cooling. First, a 34% increase in processor frequency resulted in a 33% increase in system performance over an air cooled equivalent node; this performance increase cannot be achieved in a 2U (3.5 in. or 88.9 mm) form factor with traditional air cooling. Second, the processor chips run at least 20 °C cooler when water cooled, which resulted in decreased gate current leakage and, hence, better power performance. Finally, energy consumption within the data center is significantly reduced because a large fraction of the computer room air conditioning units (CRACs) are replaced by more efficient water cooling units (WCUs) that supply water coolant to the nodes and to a rack level air to liquid heat exchanger which cools a substantial amount of the heat load exhausted to the air. A study comparing equivalent computer performance of air cooled versus water cooled clusters concluded that the power required to transfer the cluster’s heat dissipation to the outside ambient was 45% lower for the water cooled cluster [10].

The Power 775 supercomputer was first introduced in 2009 [11] and is scheduled to be generally available in 2011. The fundamental computer building block is the server (CEC) drawer. Figure 1 shows the CEC physical structure (layout). The server drawer package size is approximately 94 mm high (a little more than 2U high) × 874 mm wide × 1372 mm deep. A fully configured server drawer consists of 8 quad chip modules (QCMs), 128 custom SuperNova dual in-line memory modules (DIMMs), 8 HUB modules (switches), 17 PCI Express card slots and 2 distributed converter and control assemblies (DCCAs). The QCM is a high performance glass ceramic (HPGC) module with four Power7 8-core processor chips operating at a frequency of 3.84 GHz. The DIMMs carry a total of 80 DRAMs plus two SuperNova memory control chips. The maximum supported DIMM capacity is 32 GB (4 GB per processor core.) The HUB module provides the switching function between the QCMs and the PCI-E cards, between QCMs within the server drawer, and between server drawers though fiber optic connections. The optical connections take place at two levels. The first level connects four server drawers together to constitute a super node. The second level connects up to 512 super nodes together. This switch fabric is therefore able to connect up to 524,288 processor cores for up to a peak performance of 64 PFlops. The fully redundant DCCAs convert 350 VDC power from the bulk power assembly (described in the bulk power section) to an intermediary power level (approx. 45 VDC) and distributes it to voltage regulator modules (VTMs) mounted on the server drawer board. The VTMs in turn convert this voltage to multiple voltage levels ranging from around 1 VDC to 5 VDC.

Up to 12 server drawers (3 super nodes) are housed in a frame or rack, as shown in Fig. 2. The rack, along with its covers and doors (front and rear) measure 990.6 mm (39 in.) wide by 2133.6 mm (84 in.) tall by 1828.8 mm (72 in.) deep. A centralized (bulk) power and control system is housed at the top of the rack. This power system takes in 200–480 3-phase VAC or 380–520 VDC through four independent connections (lines), converts it to 350 VDC, and distributes it to the electronics within the rack. Two bulk power assemblies (BPAs) are utilized for redundancy and full concurrent maintenance. Complementing the server drawers is a storage disk enclosure (described in a later section). Storage intensive configurations comprising up to six disk enclosures and two server drawers are available. Up to four water conditioning units (WCUs) are housed at the bottom of the rack. Two or three WCUs are used for lesser rack configurations.

The power dissipation for this system under aggressive workloads is expected to reach 180 kW. While this is an unprecedented rack level heat load, the computer performance is equally unprecedented at a peak (theoretical) 93 Teraflops. To put this into perspective, a nearly equivalent system in 2005, the ASCI Purple [12], required 285 times the floor space and almost 19 times the power (Table 1). An equivalent Power 575 cluster in 2008 required 28.6 × the floor space and 4.8 × the power.

From the outset, the objective was to transfer 100% of the heat generated by the electronics within the rack directly to the facilities building chilled water. Stated differently, no heat was to be transferred the more conventional way: air cooling with heat rejection to room air conditioning to remove the heat from the data center. To accomplish this, as much electronics as possible was to be cooled via conduction to water cooled cold plates. The result—in excess of 96% of the heat dissipated within the rack is conduction/cold plate cooled. The up to 4% of the heat that is air cooled is removed by water by an air-to-liquid fin and tube heat exchanger mounted in the rear rack door (RDHx). The RDHx is similar in construction and function to that used in the Power 575 water cooling system. In the absence of room air conditioning, the RDHx will serve as the room air conditioning, thus regulating the room temperature.

It was also a power and cooling objective to supply/remove up to 270 kW, redundantly. For cooling, that means that operation must be maintained with one failed water conditioning unit. Actual rack power levels, however, are only expected for the maximum configuration, as shown in Fig. 2, not to exceed 180 kW.

Figure 3 shows a schematic of the Power 775 water cooling system. Up to four water conditioning units (WCUs) provide in excess of 100 gpm of system water to electronics within the rack (the WCU function and details are given in the next section.) Eaton Aeroquip ball valve quick connects are used to connect the WCU to both system manifolds and to the facilities building chilled water system. The supply and return manifolds are constructed of 1 in. × 2 in. stainless steel tubing with variously designed formed EPDM hoses terminated with poppeted quick disconnects. The up to 12 server drawers, the disk enclosure, bulk power assemblies, and RDHx are all connected to the manifold in parallel. Design flow rates to the various components are depicted in Fig. 4.

The system water temperature to the electronics is regulated by the WCUs to be between 15 °C and 24 °C. The algorithm that determines the regulation temperature is based on the room dew point, building chilled water inlet temperature, and total rack heat load. Redundant relative humidity and ambient temperature sensors mounted in the rack are used to determine the room dew point. The system water temperature is kept at a minimum of 7 °C above the dew point to ensure that condensation does not form within the electronics enclosures. When conditions permit, the system water temperature is lowered to maintain room ambient at lower temperatures by supplying lower temperature water to the RDHx. Lower system water temperatures also allow for the WCU pumps to be run at lower rotational speeds to reduce power consumption. The lower water temperatures result in lower water flow rates to maintain device temperatures. Pump power can be cut by more than half by operating with 15 °C water to the electronics. Pump rotational speeds are also set by the system configuration to reduce pump power when less flow is required.

The water conditioning unit (WCU) provides a buffer between the usually cold and dirty building chilled water and the clean, chemically controlled, above-dew point water. The WCU in the Power 775 system is very similar in structure and flow schematic to the Power 575 WCU (see Fig. 5). A water-to-water stainless steel plate heat exchanger provides the barrier between the building chilled water and the water flowing through the system. A pump on the system side is used to provide flow through the system. The position of the control valve on the building chilled water side is regulated to maintain the water temperature at a predetermined set point. Temperature sensors exist for control and diagnostic purposes on both the system and building chilled water side. Additional sensors were added to the Power 775 WCU to improve reliability. The unit is designed to tolerate any single sensor failure. The unit also comes with customer water flow meter, which isnot present on the Power 575, to provide this information to the end user.

Table 2 clearly shows that the Power 775 WCU has significantly greater capability than the 575 WCU. Although the size of the unit increased by only 1.6 ×, the pumping and heat removal capability are 2 × to 3 ×that of the Power 575 WCU. One additional major difference between the two WCUs is the location of the water connections. The P6 WCU had building chilled and system water on opposite sides of the unit, whereas the Power 775 WCU has all of the fittings on the same side. Figure 6 shows two isometric views of the 775 WCU (without top and bottom covers).

With the exception of the PCI-E cards and some low power components on the DCCAs, the Power 775 server drawer (Fig. 7) is 100% water cooled. Cold plates cool the QCMs, DIMMs, HUBs, and VTMs powering the QCM, DIMM, and HUB modules. A worst case design heat load to water is in excess of 19 kW: 8 kW from the QCMs, 6.6 kW from the DIMMs, 2.4 kW from the HUBs, and 2 kW from the DCCAs. Typical power dissipation, however, is expected to be approximately 15 kW. The PCI cards, when populated (not all server drawers are populated with PCI cards) will dissipate no more than 400 W to air.

Figure 8 gives an isometric view of the cold plate and plumbing layout within the server drawer. The server drawer connects to the rack supply and return manifolds with the quick connect fittings located in the rear. Supply and return manifolds run within and on each side of the server drawer. Nine cold plate assemblies are connected in parallel between these manifolds. The two DCCA cold plate assemblies (not shown) are connected to the manifolds via EPDM hoses and quick connects since they are the only cold plate assemblies that are field replaceable units (FRU). The other seven cold plate assemblies are mechanically connected to the manifolds with double O-ring plugs to facilitate assembly and test in systems manufacturing.

Figure 9 more clearly depicts the flow arrangement within the server drawer. A total design flow of 7 gpm is distributed to the 9 cold plate assemblies as shown. Among the challenges confronting the cold plate design, cost was paramount. The cold plates had to come from the least expensive cooling technology possible and at the same time still meet temperature objectives for each electronic component. The cooling technology selected was aluminum plates with imbedded copper tubes. Yet another challenge was to properly distribute the water appropriately among the cold plate assemblies. The ANSYS CFX [13] was used to perform computational fluid dynamics analyses to predict water flow through each path and identify areas for improvement to achieve the desired flow rate targets. Tube lengths, diameters, and cold plate flow arrangements (series versus parallel, number of passes) were considered when establishing the desired flow distribution. In particular, the QCM and HUB cold plates required a modification to the tube arrangement within the cold plate structure (depicted in Fig. 1) to reduce cold plate flow impedance. The pattern seen in the figure reduced impedance by 13.5% as compared to a conventional serpentine pattern by replacing two smaller radius bends with two larger radius bends.

Water cooling the DIMMs was a completely new challenge. The design had to be capable of removing 52 W per DIMM. The DIMM assembly, as shown in Fig. 1, proved to be the best thermal solution for this application. The assembly consists of an aluminum spreader with an imbedded heat pipe to cool the two ASICs and forty DRAMs on the front side of the card. An aluminum plate is used to cool another 40 DRAMs on the back side of the card. These two aluminum pieces are mechanically held together with the use of screws at each end and three spring loaded screws to minimize thermal interface resistance. The two aluminum spreaders transfer the heat generated by the DIMM components into each end of the assembly.

The DIMM assembly is then mechanically connected into the cold rails with two bolts. A high performance solid metal interface material is used between the DIMM assembly and cold rail to efficiently transfer the heat from one component to the other. Figure 1 shows an exploded view of the cold rail assembly.

A cold rail assembly consists of two 30.5 in. long aluminum blocks: the inlet side cold rail and the outlet side. Each cold rail has four copper tubes as seen in Fig. 1 which carries the cooling fluid. The cold rail assembly includes four crossover tubes between one inlet side cold rail and the outlet side. The four crossover tubes are twisted so that those tubes in the uppermost position on the inlet side cold rails (tubes 1 and 2 in Fig. 1) end up in the lower two positions on the exit side cold rail and vice versa for the other two tubes (tubes 3 and 4). This is done to improve the overall heat transfer by moving the cooler fluid exiting the inlet side cold rail closer to the hot surface on the outlet cold rail.

There are two cold rail assemblies per node. After the cooling fluid exits the cold rail assembly, it enters another cold plate to cool 192 voltage transformation modules (VTMs). This cold plate is located on the back side of the main board and is also used as a stiffener plate for the whole structure. Finally, the cooling liquid travels to the return manifold and exits the node.

The new Mariner hybrid air and water cooled bulk power assembly, co-developed with Delta, contributes the bulk power conversion function for the (up to) 270 kW Mariner system. In a fully configured rack, the Mariner system includes 2 BPAs (bulk power assemblies), each of which is fed by 2 line cords (to achieve n +1 line cord redundancy), and comprises 6 BPRs (bulk power regulators), 1 BPD (bulk power distribution, which consists of 2 separate and redundant systems within one chassis), and 1 BPCH (bulk power communication and hub). Since this is a new architecture, much of the basic thermal protection and control structures and algorithms were new developments. In summary, the BPRs dissipate up to 90% of their generated heat into water cooled cold plates and the BPD SCBs (static circuit breakers) are cold plate cooled; the remainder of the generated heat is dissipated into air, flowing from the front of the rack to the rear, impelled by 4 parallel BPFs (bulk power fans). The BPFs are 60 mm dual rotor counter-rotating fans, with their speed controlled by PWM (pulse width modulation). Thermal protection is provided by a multitude of thermistors on each BPR, BPD, and BPCH, each of which has an associated thermal limit determined by extensive testing at IBM and Delta in conjunction with the component specifications provided by Delta. In the BPRs, thermal protection is provided by firmware—temperature limits are compared at the unit level with actual readings, and in the event of an over-temperature limit breach, each individual BPR can power down to avoid damage. In the BPD, each of the 26 SCBs is instrumented, such that if the electrical current draw on any one circuit is too great and results in an unsafe temperature, the firmware can deactivate the stressed circuit. The air-cooled components in the BPD are further protected by an airflow sensor, which can detect a low airflow situation and deactivate all of the SCBs. The air-cooled BPCH, however, is protected only by an airflow sensor at the firmware level.

Figure 1 is an isometric view of the BPE with the BPRs, BPCH, and BPD removed. As shown by arrows in the figure, airflow is from the upper right (the front of the rack), through the units in the front of the chassis, then outward to both sides where the fans reside, and then through th BPRs which fit into the rear of the chassis. Water flow for the cold plates enters through quick disconnects at the rear of the chassis into manifolds which feed all of the rear BPRs and the front manifolds which, in turn, feed the front BPRs and the BPD, all in parallel. In operation, the BPRs receive 0.5–0.7 gpm depending on the system configuration, and the BPD receives roughly 0.15 gpm. The airflow is variable based on altitude and the temperatures of various components, and ranges from 75-200 CFM total in normal operation.

Figure 1 is a top-down view of a BPR with the aluminum cold plate base material removed. The cold plate consists of an aluminum plate with pressed-in rectangular copper fluid channels, shown in the figure as outlines. Using two manifolds, the cooling water is split into three paths and recombined inboard of the two quick disconnects attaching the cold plates to the BPE manifolds. Between air and water cooling, each BPR is configured to dissipate up to 2.6 kW while maintaining component temperatures below the target temperatures chosen to meet the reliability requirements. The main heat-dissipating components are affixed to aluminum blocks which stand vertically within the BPR enclosure (the rectangles under the tubing outlines in the figure), which are, in turn, fastened via screws to the cold plate, which also acts as the top surface of the enclosure.

Similarly, some components (the static circuit breakers for example) in the BPD are mounted to a cold plate, as shown in Fig. 1. The BPD supplies power to each of the server drawers in a Mariner system via the connectors shown in the lower left of the figure. The cold plate cools the main power dissipating components residing beneath it, and the remainder of the circuitry is air cooled. The BPCH, however, is all air cooled.

The storage disk enclosure is an electronics chassis structure that houses up to 384 small form factor (3.5 in.) SCSI hard disk drives (HDDs) in a volume which is twice the size (i.e., twice the height) of the server drawer (Fig. 1). The HDDs are packaged four to a card tray that can be concurrently inserted or removed from the chassis. Forty-eight trays can be inserted/removed from the front of the chassis; the other 48 trays can be inserted/removed from the rear of the chassis. Power conversion and sas (serial attached SCSI) expander cards are also housed in the front with additional sas expander cards housed in the rear.

Packaged between the front and rear electronics sections are 12 Nidec Servo G1238 120 mm × 38 mm fans arranged six across by two deep, with an air-straightener interposed by each pair, generating a nominal of 173 CFM airflow through the chassis. Immediately downstream of the fans is a cross flow air-to-liquid fin and tube heat exchanger referred to hereafter as the mid-bay heat exchanger (MBHx). No less than 95% of the heat dissipated by the electronics in the front section of the chassis is removed by the water flowing through the MBHx. It would not be possible to cool the electronics in the rear of the chassis to acceptable (i.e., reliable) temperatures without removing the preheat that would otherwise come from the electronics in the front of the chassis. In most conditions of air flow, power dissipation, temperature of the air entering the chassis, and water inlet temperature, the MBHx will remove more than 100% of the heat added by the electronics. The air temperature exiting the MBHx will be less than that entering the chassis.

For example, under nominal conditions, with all the fans operating at 2800 rpm at 23 °C room, and at sea-level with 4.0 gpm water flow at 18 °C, the MBHx removed approximately 2300 W from the airflow. A fully configured chassis’ nominal dissipation is roughly 3500 W, with 2000 W being dissipated in the front section (upstream of the MBHx). The MBHx extracted 300 W (15%) more heat than was put into the air from the electronics.

The Power 775 water cooling system represents a monumental leap forward in computer performance and energy efficiency. Thus, 100% water cooling is an enabling technology. In addition to the processor, memory, power conversion, and I/O electronics conduct their heat directly to water through an aluminum plate with embedded copper tube cold plates.

ASIC =

application specific integrated circuit

BCW =

building (facilities) chilled water

BPA =

bulk power assembly

BPCH =

bulk power control / hub

BPD =

bulk power distribution

BPE =

bulk power enclosure

BPR =

bulk power regulator

CEC =

central electronics complex

CFM =

cubic feet per minute

CRAC =

computer room air conditioning unit

DCCA =

distributed converter and control assembly

DIMM =

dual in-line memory module

DRAM =

direct random-access memory

°C =

degrees Celsius

°F =

degrees Fahrenheit

EPDM =

ethylene propylene diene monomer (m-class) rubber

gpm =

gallons per minute

GB =

gigabytes (109 bytes)

in. =

inches

MBHx =

mid-bay heat exchanger

PCI =

peripheral component interconnect

PFlops =

peta (1012 ) floating point operations per second

QCM =

quad [4] chip module

psi (gauge) =

pounds per square inch, gauge

RDHx =

rear door heat exchanger

SCB =

static circuit breaker

TFlops =

tera (109 ) floating point operations per second

U =

rack unit height (44.45 mm or 1.75 in.)

VTM =

voltage transformation module

WCU =

water conditioning unit

Copyright © 2012 by American Society of Mechanical Engineers
View article in PDF format.

References

Figures

Grahic Jump Location
Figure 1

Server (CEC) drawer physical structure

Grahic Jump Location
Figure 2

Fully configured Power 775 computer rack

Grahic Jump Location
Figure 3

Schematic of the Power 775 water cooling system

Grahic Jump Location
Figure 4

System design flow rates

Grahic Jump Location
Figure 5

Power 775 water conditioning unit schematic

Grahic Jump Location
Figure 6

Two different isometric views of the Power 775 WCU (covers off)

Grahic Jump Location
Figure 7

View of a Power 775 server drawer

Grahic Jump Location
Figure 8

Server drawer cold plate/plumbing layout

Grahic Jump Location
Figure 9

Server drawer flow schematic

Grahic Jump Location
Figure 10

QCM cold plate assembly illustrating custom tube arrangement within the aluminum structure

Grahic Jump Location
Figure 11

Exploded view: DIMM assembly

Grahic Jump Location
Figure 12

Exploded view showing the [2] cold rail assemblies

Grahic Jump Location
Figure 13

View of a DIMM attached to the cold rails (sectioned through the cold rails)

Grahic Jump Location
Figure 14

Isometric view of the bulk power enclosure (BPE)

Grahic Jump Location
Figure 15

Top-down view of a bulk power regulator (BPR)

Grahic Jump Location
Figure 16

Isometric view of a bulk power distribution (BPD) unit with the cover removed

Grahic Jump Location
Figure 17

Disk enclosure featuring mid-bay heat exchanger

Tables

Table Grahic Jump Location
Table 1
Comparison of 3 generations of IBM high performance compute platforms
Table Grahic Jump Location
Table 2
WCU comparison: Power 575 vs Power 775

Errata

Discussions

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In