0
Review Article

Evolution of Hardware Morphology of Large-Scale Computers and the Trend of Space Allocation for Thermal Management OPEN ACCESS

[+] Author and Article Information
Wataru Nakayama

Life Fellow ASME
ThermTech International,
920-7 Higashi Koiso,
Oh-Iso Machi, Kanagawa 255-0004, Japan
e-mail: watnakayama@aol.com

Contributed by the Electronic and Photonic Packaging Division of ASME for publication in the JOURNAL OF ELECTRONIC PACKAGING. Manuscript received July 29, 2016; final manuscript received October 18, 2016; published online November 23, 2016. Assoc. Editor: Justin A. Weibel.

J. Electron. Packag 139(1), 010801 (Nov 23, 2016) (22 pages) Paper No: EP-16-1092; doi: 10.1115/1.4035019 History: Received July 29, 2016; Revised October 18, 2016

Thermal management of very large-scale computers will have to leave the traditional well-beaten path. Up to the present time, the primary concern has been with rising heat flux on the integrated circuit chip, while a space has been available for the implementation of high-performance cooling design. In future systems, the spatial constraint will become a primary determinant of thermal management methodology. To corroborate this perspective, the evolution of computer's hardware morphology is simulated. Simulation tool is the geometric model, where the model structure is composed of circuit cells and platforms for circuit blocks. The cell is the minimum circuit element whose size is pegged to the technology node, while the total number of cells represents the system size. The platforms are the models of microprocessor chips, multichip modules (MCMs), and printed wiring boards (PWBs). The major points of discussion are as follows: (1) The system morphology is dictated by the competition between the progress of technology node and the demand for increase in the system size. (2) Only where the miniaturization of cells is achieved so as to deploy a system on a few PWBs, ample space is created for thermal management. (3) In the future, the cell miniaturization will hit the physical limit, while the demand for larger systems will be unabated. Liquid cooling, where the coolant is driven through very long microchannels, may provide a viable thermal solution.

In the era of internet of things (IoT), the world depends on the gigantic fabric of communication networks. At the hubs of internets, server computers control traffic of diverse sorts of information generated and consumed by the modern society. Internets are still in the process of proliferating, and the volume of information traffic is growing. To sustain exponential growth of information traffic, server computers of increasing processing capacity go into service at the key sites of networks. Enhancement of processing capacity entails increasing power consumption by the server. Those computers require increasingly stringent thermal management to guarantee the reliability of operations. The cooling system for servers in a data center must be upgraded, and enhanced cooling generally means the rise of power consumption by the cooling system. The data center power is thus composed of the power consumption by computers and that by the cooling system; the ratio of these power components is now approaching unity. This situation drives intensive research efforts aiming at developments of highly energy-efficient cooling systems [1].

There is another issue which receives less attention from the heat transfer community but warrants concern of data center developers, that is, the expansion of floor space to accommodate an increasing number of servers. The space issue is most pressing for data centers in metropolitan areas where the land is inevitably expensive. Today, data centers of largest scale are located in open lands, far removed from urban areas. Even for such remote data centers, the land is not limitless, so that the expansion of floor area will approach a limit sometime in the future. It should be noted that the issue of floor space is tightly coupled with the thermal management engineering. Obviously, cooling of servers occupying a large floor area is itself an engineering challenge. But, in the present study, our focus is not on thermal solutions for actual data centers. We examine the subject of space/thermal coupling from a somewhat unconventional angle. The motivation of the study is illustrated by a question: What will define the future direction of thermal management technology for very large computing systems? Among many factors that are relevant to this question, we choose the space occupied by hardware components and discuss how the space for cooling design will have to shrink in order to avert explosive expansion of the floor space.

The floor space will also become an impediment factor for further expansion of supercomputing centers. Supercomputers are the tools of large-scale numerical simulations of scientific, industrial, and social importance. Modeling of global climate change, weather forecasting, and simulation of biological processes are the examples that require ever more powerful supercomputers. Presently, the spatial and temporal resolutions of simulations allowed by the existing supercomputers are not as sufficiently fine as required in such applications. Supercomputer projects are in progress in several countries, and the supercomputer performance has become an item of international competition. Top-rank supercomputing centers are massive facilities, often equipped with dedicated power stations. Their power rating has reached several megawatts today [2]. While the rising power requirement is now a primary concern for supercomputer projects, the spatial constraint is looming as another barrier to the progress of supercomputing [3,4].

In system-level packaging, the building block components, ranging from chips, packages, wiring boards, to racks, are assembled in order for the system to meet the design goals of computing performance, power consumption, and space occupancy. To achieve the design goals, the synthesis of different engineering disciplines is required. The relevant disciplines range from electronics, electrical, software, materials, mechanical engineering, to manufacturing. Thus, the systems packaging is a multidisciplinary undertaking, and its description warrants a book [5] or a handbook series [6]. In the present study, we attempt to interpret the evolution of systems packaging with a much narrowed focus. We set our focus on the geometric configuration of computers, review its evolution in the past decades, and project possible developments in the future.

In general, the progress of computer technology is captured on a trajectory in the domain spanned by three axes: space, energy, and time, where the time axis provides the measure of computing speed. Focusing on the spatial requirement of computing leaves some issues of interest outside the scope of the present study. Three-dimensional (3D) packaging at the chip level (chip stacking), now being pursued through refinement of through-silicon-vias, contributes to compact packaging at the system level. Although 3D packaging has obvious impacts on the system's physical volume, its benefits are more pronounced in speedup of computing and energy saving in computation. Hence, the discussion on chip-level 3D packaging is reserved for future study where time and energy will be brought in the scope. Innovations in chip-level circuit architectures, such as employment of coprocessors or graphics processing unit chips among central processing unit (CPU) chips, have a certain impact on the system's construction. However, discussion about them needs to consider job flows in the space–time–energy domain and hence left to a future expanded study. While the innovation in chip-level architecture provides the computer designer with the opportunity to use improved building block components, the system-level architecture depends on a large extent on the type of computing jobs to be processed. The job type defines the pattern of data flow in the system. The data flow pattern in cloud computing is distinct from that in scientific computing. Thus, server computers in data centers have different internal organizations from those of supercomputers. We leave in-depth discussion about cloud versus supercomputing in future studies, since the subject is concerned with location of the design point in the space–energy–time domain.

As such, these issues of interest to computer and packaging professionals are set outside the scope of the present study. The present study intends to develop a perspective over the evolution of computers in a long time range that extends from the past, through the present, and far into the future. The computers of the 1980s and 1990s are the materials of choice in starting the analytical development. Those computers are fairly obsolete for the current generation of computer designers and users. However, some aspects of the technological development of this era serve as the illustrative examples to explain the fundamental dynamics of hardware evolution.

In a long-range perspective extending over several decades, the variation in hardware designs observed in computers of contemporary generations is of secondary importance. For example, the race in supercomputer development, as publicized in the list of TOP500, mirrors the competition between different design strategies. The processing performance of top-ranking computers is in a range of several tens of peta floating point operations per second (FLOPS), and the race is to achieve a higher score in peta-scale computing. Comparison of the hardware designs of existing supercomputers is categorized as a study in short-range perspective. By contrast, the present study is partly motivated by our concern with the technological hurdles that are looming tall for exa-scale computing and beyond. An account of the challenges in regard to the spatial requirement by future generation supercomputers is presented near the end of the paper.

To complete the description about the scope of the study, we add a note about the term “system.” The system in our definition is a computer housed in a rack or an array of racks. Where the system's spatial requirement grows large, we may split the system into a set of moderately sized subsystems and place those subsystems at separate locations instead of building a massive computing facility in a large tract of land. Indeed, this concept is the base of distributed computing scheme. The design of a distributed computing system involves the considerations of energy and time required for data traffic in the network of subsystems. Discussion on distributed systems needs expansion of the study scope, and thus has to be reserved for a future report. Meanwhile, we find a rationale for paying attention on massive computer facilities in the recent wave of construction of large-scale data centers and supercomputing centers.

The paper is organized as follows: Section 2 describes the concept behind the geometric modeling of computing systems and introduces the terminology devised for the present modeling. In Sec. 3, the geometric features of the mainframe computers of the 1980s and 1990s are reproduced on the models. While these computers of the past decades used logic chips of single core, most of the current generation computers have multicore processors. Section 4 describes the models of multicore processor chips, where the relationship between the technology node and the number of cores is discussed. Section 5 is devoted to discussion about the space in a rack. The number of PWBs in the rack reflects the balance between the dimensions of building block circuits and the total number of circuits (system size). The free space in the rack shrinks, where the demand for system size grows at a faster rate than that of circuit miniaturization. The system's physical volume depends also on the scheme of circuit layout, particularly how logic and memory circuits are deployed on PWBs. While the model in Sec. 5 assumes a particular circuit layout, in Sec. 6 the effect of circuit layout on the spatial requirement by the system is examined. Section 7 discusses multiple rack systems of future generation computers. Based on the outlook for spatial constraints on large-scale computers, the emerging needs for thermal management are discussed. Section 8 concludes the paper.

We devise the models of computer's hardware where most of the structural details of actual computers are shed off, and only the salient geometric features are retained. Similar to the structural organization of actual computers, the components of different sizes are assembled in a hierarchical order. In actual computers, the components at adjacent hierarchy levels are connected electrically through the input/output (I/O) ports, but the I/O ports are not included explicitly in the model. The I/O ports play some roles in defining the hardware morphology; their roles are tightly coupled with those of wires fabricated on the components. The progress of circuit integration is synonymous with the rising wiring density and increasing I/O ports on the component. Meanwhile, the cost of circuit fabrication on the component rises with increasing the wiring density and the I/O port population. The cost penalty multiplies where the component area is large, due to reduction in product yield. As a way to mitigate the cost penalty, the expansion of component area is suppressed. This is an interpretation of why the physical sizes of chip, PWB, and other intermediate substrates have been slow to increase, while the circuit carrying capacity of these components has increased dramatically. In the present model, the sizes of those components corresponding to the chip, the multichip module, and PWB are assumed to be invariant or bounded by the upper limit during the morphological evolution.

Figure 1 depicts the schematics of model components and their layouts. The atomic unit of circuit is “cell.” The cell to perform logic processing is “L-cell,” which corresponds to the circuit unit termed in several ways such as elementary circuit, logic gate, or logic grain at finest resolution level. The cell to store instruction or data is “M-cell.” The square for M-cell is shown smaller than that for L-cell, symbolizing the relatively small occupancy area by a memory cell. Note that the memory cell involves only a few switches, while the elementary logic circuit such as an adder involves a few tens of switches.

A certain number of L-cells are assembled to form a logic block, L-block. The L-block is a higher level circuit grain equipped with I/O ports to communicate with other L-blocks and M-cells. The L-blocks and M-cells are implemented on hardware components in basically two ways, as shown in Figs. 1(a) and 1(b).

In Fig. 1(a), the “tile” is a distinct hardware component that accommodates a single L-block. In Fig. 1(b), multiple L-blocks are integrated together with a block of M-cells on a hardware component named “C0”-card. The name “card” is introduced to mean a platform for multiple L-blocks, M-tiles, or a mixture of L-blocks and M-tiles. In actual systems, the tile is the chip, and C0-card of Fig. 1(b) is a multicore microprocessor chip. The C0-card of Fig. 1(a) accommodates L-tiles and an M-tile; this is an equivalent of a multichip module. The “0” of C0-card means the basic level in the hierarchy of cards. At the next upper level, there is a C1-card where multiple C0-cards are mounted. Further up the hierarchy ladder, a C2-card accommodates multiple C1-cards, a C3-card has multiple C2-cards, etc. The arrangement of cards defines the system's geometric morphology.

All model components in Fig. 1 are square pieces. Besides, the low-level components assume geometrically regular locations on an upper-level component. These geometric features are shared by actual components corresponding to the cards of level 0 and higher, such as microprocessor chips, multichip modules, and PWBs. Geometric regularity breaks down to some extent in actual circuit layouts at a level corresponding to the cell level. Nevertheless, we assume that square cells pack the area of a logic block or an M-tile. This part of modeling is interpreted as follows. A system is composed of a large number of cells, and the ratio of system/cell length scales amounts to the order of 106–108. In describing the construction of such a system in terms of cells, we employ the notion of statistically averaged cell. The side length of a cell represents the statistically averaged length scale of elementary circuits in actual circuit blocks. In other words, the assumed square geometry of a cell has little physical significance; only the length scale of a cell and the number of cells in a circuit block are the quantities of our interest.

Note that the call names tile and card are not commonly used in the computer literature. They are introduced here to capture the hardware components in the context of hierarchical assembling. As mentioned above, the card of a particular hierarchical level may correspond to a chip, a multichip module, or a PWB in actual systems. The correspondence between the models and actual components shifts with the evolution of technology. For example, circuit blocks that once occupied a PWB are accommodated in a multichip module (MCM) as a result of the progress of circuit integration; the PWB and the MCM in this example are represented by a C1-card. For the convenience of readers, in describing the hardware evolution of the 1980s and 1990s, the name of actual component will be appended to the name of corresponding model component. During this period (1980s–1990s), the correspondence between model and actual components was redefined in response to the progress of circuit integration technology, and the shift of correspondence mirrors distinct change of the system morphology. Later in the 2000s, the chip-level integration has resulted in the increase in the number of logic circuit blocks on microprocessor chips. The logic circuit block on a microprocessor chip is commonly called “core,” and the number of cores on the chip will be one of the major parameters in the description of microprocessor-based technology. Also to be noted is that, in the era of microprocessor-based technology, the correspondence between model and actual components has shifted little. Hence, the use of names of actual components, that is, chip, MCM, and PWB, is deemed sensible. In Secs. 5 and 6, we adopt the conventional names of components.

Hereafter, we introduce a convention to specify the main function, logic or memory, borne by the card. “L” is appended to the card designation where the card contains L-tiles regardless of the presence or absence of M-tiles, while “M” in front of the card designation means that the card contains only M-tiles. However, these function symbols are omitted wherever the explicit distinction of card function is not required.

In the years covered in this section, the computers called “mainframes” were the manifestation of the then most advanced device and packaging technologies. We choose three generations of mainframe computers developed by a particular manufacturer, Hitachi. The computers were M680 [7], its shipment year was 1985, M880 [8], 1991, and MP5800 [9], 1995. The literature [79] covers diverse issues of software design, process performance, packaging and manufacturing, and cooling. From the mass of information compiled in the literature, the geometric data of components and systems were extracted and used to construct the corresponding models. Not all the necessary dimensional data are available in the literature; guess work was required to determine those missing dimensions. Besides, the model components are not precise reproductions of the corresponding actual components; many of the original dimensions were rounded in the process of model construction. Hence, the equivalence between the models and the actual computers is approximate. The modeling is intended only to illustrate the impacts of circuit integration on the morphologies of components and systems. The models are described with the illustrations in the following. In all the models, three hierarchy levels of memory are assumed.

Model 1 (equivalent of the mid-1980s mainframe): Figure 2 shows the components up to structural level 1. A C0-card holds a unit of logic processing block composed of L-tiles and M-tiles of levels 1 and 2. Two C0-cards are butt-joined to form a LC1-card. Level-3 memory is conventionally called “main memory.” An array of M3-tiles (eight columns and 18 rows) is mounted on the two sides of a MC1-card (PWB).

Figure 3 shows the system construction. A C2-card (mother board) has the connector slots for 20 x C1-cards, with a slot pitch of 25.4 mm (1 in.). There are three C2-cards, one of which is the platform for an array of LC1-cards and the other two for MC1-card arrays. Communications between the C2-cards are provided by cables (not shown).

Model 2 (equivalent of the early-1990s mainframe): Figure 4 shows the layout of tiles on C0-cards (multichip modules (MCMs)) in model 2. LC0-card has a mix of L- and level-1 memory (M1-) tiles, while MC0-card accommodates level-2 memory (M2-) tiles.

As shown in Fig. 5, 23 x LC0-cards and 2 x MC0-cards are mounted on a LC1-card (PWB). The level-3 memory (M3) is the main memory. The M3-tiles are mounted, in 30 rows and 30 columns, on a MC1-card which has the same dimensions with LC1-card.

Figure 6 shows the construction of model 2 system. Two sets of LC1/MC1 card pair are laid in a plane, and interconnections between level-1 cards are provided by cables (not included in the figure).

Model 3 (equivalent of the mid-1990s server computer): By the mid-1990s, the market for large-scale computers had expanded considerably in response to the explosive growth of networks of personal and office computers. A computer that handles information processing and communications at the hub of a network is called server. The manufacturers of mainframe computers began supplying server computers that serve at the hubs of large-scale networks. The design and manufacturing technologies for the servers are the extended versions of those that were developed for the mainframes. Model 3 is the equivalent of a server computer that was built in the mid-1990s.

The components of model 3 up to C0-card level have the same structural organizations with those of model 2, whereas the construction at level 1 is different from that of model 2. Figure 7 shows the installation of C0-cards (MCMs) and M3-memories (main memories) on a C1-card (mother PWB). Five C0-cards are mounted on a C1-card; four of them are LC0-cards, and one is MC0-card. The M3-tiles (main memory chips) are accommodated in a compact 3D volume, where 14 × MC1-cards (PWBs) are set in an array with a placement pitch of 12.7 mm. The array of MC1-cards is plugged to a part of the C1-card area. As shown in Fig. 8, model 3 system is composed of two C1-cards.

Summary of the evolution of system morphology (1980s–1990s): Tables 1 and 2 provide the quantitative information about the models. As for the dimensional data of the components, the footprint area is a primary variable. Where multiple cards are assembled in array, the intercard distance is another dimensional parameter. The placement pitch of cards is dictated by the pitch of connector slots on the mother PWB, 25.4 mm (model 1) or 12.7 mm (model 3).

Table 1 is the collection of footprint area data. The footnotes describe the correspondence between the model and actual components. The tile in model 1 is the single-chip (leaded chip-carrier) package. In models 2 and 3, the tile is a bare chip, where C4 bonding of the chip to the substrate is assumed. The employment of MCM packaging reduced the size of C0-card in models 2 and 3 from that in model 1. In model 2, a large PWB is employed to accommodate 36 MCMs as shown in Fig. 5. In model 3, the number of MCMs is reduced to five, bringing about the decrease in the required PWB area. Model 3 has the dedicated main memory PWBs (MC1-cards).

Table 2 lists the accommodation capacity of the components. The numbers of cells per tile are the indicators of the progress in circuit design and fabrication technology. We will discuss them shortly referring to Fig. 9. The number of tiles on C0-card is maintained at 36 across the three generations. However, the tile compositions are different, as shown in the footnote. In model 1, the C0-card accommodates M1- and M2-tiles, while, in models 2 and 3, M2-tiles are mounted on a dedicated substrate (MC0-card). The data of M3-tiles on MC1-card reflect the area of MC1-card (PWB area); the wide mother PWB in model 2 accommodates the largest number of M3-tiles. The information listed in the bottom three rows, from C0-cards/C1-card to system construction, is obvious from Figs. 28.

Table 3 is the summary of total cell populations in the system. Increase in the populations of all types of cells is brought by the progress of circuit fabrication technology. While this is the case in general, the population growth of each functional cell reflects the options made available to the system designer by the development of packaging technology. An example is the transition of cell populations to be observed in the columns for models 1 and 2. While the demand for higher computing capacity drove up the L-cell population by a factor of about 9, the M1-cell population in model 2 was reduced by about 20% from that in model 1. Reduction of level-1 memory is compensated by an increase in the M2-cell population. Note that model 2 is based on the MCM packaging. The integration of M2-tiles on the dedicated MCM (MC0-card) facilitates “fattening” of level-2 memory, while the layout of MCMs as shown in Fig. 5 minimizes the average distance of access from logic modules (LC0-cards) to MC0-cards.

Another factor affects the formation of model 3. That is, the server computer (model 3) was designed to handle a more disparate range of data than those assumed for the earlier computers (models 1 and 2), and its operation involves search and transfer of data at high frequencies. “Fat” memories close to the logic circuits, as illustrated by the increased population of M1-cells, help to improve the performance of server computing. Meanwhile, relatively modest increase in the L-cell population in model 3 reflects the option in system design; the enhancement of performance in server computing was given a priority over the expansion of capacity for traditional scientific and business transaction computing.

Evolution of the computer technology can be viewed through various windows. For computer designers, the rate of circuit miniaturization over the years is one of the major yardsticks to decide on the construction of a next-generation computer. Since the tile (chip) size varies in a relatively narrow range, 24 mm square in model 1 and 10 mm square in models 2 and 3, the number of cells in the tile reflects the degree of circuit integration. Table 2 includes the data of per-tile cell population, and Fig. 9 is the graphic presentation of the data. The number of L-cells on the tile increased at the pace of about tenfold in five years. The per-tile M-cell population is influenced by the presence of control circuits on actual chips. In the present model, the control circuits are implicitly accounted for by the enlargement of memory cell. In general, an actual memory chip at a lower hierarchy level (located closer to L-cell) has a smaller number of memory cells and yields a larger share of its area to the implementation of control circuits; hence, the M-cell population in the model becomes smaller at a lower hierarchy level. However, in model 3, the memory chip of the highest cell density is assumed as the building block at all hierarchy levels. This assumption is employed to amplify the feature of server computing where search logic is implemented mostly in the logic circuit blocks, and the memory capacity is elevated to high levels at all hierarchy levels.

Figure 10 shows the plot of side lengths of the cells in three generations of the computer model. The length scale of L-cell decreased from 0.537 mm in model 1, to 0.0887 mm in model 2, and to 0.0281 mm in model 3. The length scales of memory cells in models 1 and 3 are nearly ten times less than that of L-cells, the smallest scales pertaining to M3 (main memory). The spread of cell scales between the hierarchy levels is large in model 2, while the cell sizes collapse to a single point in model 3. These patterns seen in the plot of memory cell sizes reflect design options described in the previous paragraphs.

The solid curve in Fig. 10 is the trace of technology node data found in ITRS reports [1012]. Note that the technology node is the half-pitch of the metal lines on a transistor. The figure of technology node is less than the length scale of a circuit cell by several orders of magnitude. To be noted is the trend of cell sizes that is derived from the dimensional data of actual computers without explicitly referring to the technology nodes at the times of those computer designs. The rate of reduction of cell sizes is nearly parallel to that of the technology node. This gives us a certain rationale for the application of the modeling methodology to project how the circuit technology drives the evolution of system morphology.

Figure 11 summarizes the physical volumes occupied by the model computer systems. The net volume on the vertical axis needs some explanations. The net volume of model 1 is the sum of three sets of PWB arrays (Fig. 3). The net volume of model 2 is the sum of the areas of four mother boards (Fig. 6) times the assumed width of free space, 100 mm, in front of the mother board sets. The net volume of model 2 almost coincides with that of model 1. The net volume of model 3 is the area of mother board (C1-card) times the width of the main memory board (MC1-card) times two (for two sets of C1/M1 assembly, Figs. 7 and 8). Note that the space occupied by the actual computer is much larger than the figures of Fig. 11 due to the need to accommodate components such as disk drives, voltage regulators, fans, blowers, and pumps. The actual system space also includes the space for maintenance work. In the following discussion, these factors that influence the actual system volume will be set aside to focus our attention on the issues involved in the processor unit per se.

The trend of net volume figures presented in Fig. 11 is the result of interactions of several parameters. The most relevant parameters are those pertaining to L-cells and M3-cells (main memories); both L- and M3-cells have appreciable impacts on the system volume. The L-cells population is less than the M3-cell population by two to three orders of magnitude, but the size of L-cell is larger than that of M3-cell by a factor of ten and more. Besides, the placement of L-tiles requires a relatively large intertile spacing where circuit test pads and intertile communication channels are provided. On the other hand, the M3-cells are the smallest of all the cells constituting the system, but the M3-cell population is dominantly large. Meanwhile, the M3-tiles can be packed with only narrow intertile spacing for the following reasons. First, the function of the memory tile, which is storage of data, entails low level of intertile communications, so that the wiring on M-card is relatively simple and does not require a wide space. Second, simple circuit organization on M-card reduces the need for circuit test pads, another space saving factor.

The interplay of the parameters (the cell sizes, the cells populations in the system, and the deployment of components in plane or array) illustrated on the above three models will be a major mechanism also in driving the evolution of hardware morphology in future generation computers. Figure 12 depicts the coupling of cooling scheme with geometric construction of the computing system. Air cooling (Fig. 12(a)) was used in the mainframe simulated by model 1. The computers corresponding to models 2 and 3 had high rates of heat dissipation, thus required high-performance cooling. Fortunately, in two-dimensional system configuration, an ample space was created in front of the mother board where water-cooled cold plates and coolant piping could be installed (Fig. 12(b)). Note that model 3 still has two-dimensional deployment of logic MCMs, and hence an ample space for cooling these modules.

Near the end of the corresponding era (the mid-1980s–late 1990s), further increase of heat dissipation from computers was expected, and research on high-performance cooling gained momentum. Surveying the literature of that period, one finds that in most of heat transfer research, the availability of a wide space was tacitly assumed, and little concern was given to the space required by cooling devices and coolant distribution hardware. Notable examples of space-greedy cooling devices are bulky air-cooled heat sinks and jet-impingement devices. Microchannel cooling has become a popular research topic, but the space required for coolant pipe connections has not been given due attention. Disregards for the space requirement by cooling devices have been carried over in recent heat transfer research. The discussion in Secs. 6 and 7 addresses the question of how long such assumption will be applicable to cooling large-scale computers.

The evolution of microprocessor dates back to the early 1970s. An early microprocessor was composed of several functional and memory chips (chip set). By the early 1990s, the mainframe manufacturers began adopting microprocessors as the base building blocks of large-scale computers. Microprocessors themselves have evolved and spread their processing performance and cost in a certain range. The labels “high-end” and “low-end” microprocessor testify the spread of microprocessor applications to entire spectrum of computers.

The microprocessor has component circuits that perform basic operations such as add, multiply, route, conditional branch, and others. Added to these functional circuits is a small memory block called “cache” memory. The progress of circuit fabrication technology has enabled the adoption of parallel processing scheme in microprocessor design. To implement parallel processing circuits, a block of functional circuits and a cache memory is used as the building block unit, and the unit block is replicated multiple times in the chip area. In this context, the unit circuit block has acquired a name core. Multicore processors began appearing in the market in the early 2000s, although some forerunners were developed earlier. The number of cores has steadily increased, starting from 2 to 4, 8, 16, and beyond. In future generation microprocessors, possible increase of core populations to hundreds and even a thousand is envisioned [13]. “Many-core processor” is a label for such advanced versions.

In the following analysis, we intend to describe a relationship between the number of cores and the length scale of cells on the microprocessor chip. The core corresponds to L-block of our geometric model, and C0-card of Fig. 1(b) is now the chip. Hereafter, the terms core and “L-block” will be used interchangeably, depending on the context of narrative: core when we extend our thought to actual microprocessors, and L-block when we include logic and memory circuits together in our scope.

We suppose a model of microprocessor chip depicted in Fig. 13. The chip has an area, lC0 × lC0, which is partitioned into areas for L-blocks and a memory block. nLB,x × L-blocks are packed side-by-side in a row, and nLB,y rows of L-block arrays are placed symmetrically in the chip area. The illustration in Fig. 13 shows an example where nLB,x = 8 and nLB,y = 2, hence 16 cores. Two levels of memory are supposed: the level-1 (M1) memory block is included in L-block, and the level-2 (M2) memory block occupies a middle half of the chip area.

The dimensions of circuit blocks are as follows: L-block lLB,x × lLB,y, M1-block lLB,x × lM1B,y, and M2-block lC0 × lC0/2. Each block is filled with cells of specific function. Logic (L-) cells occupy a major part of the L-block area. The M1-block strip is included in the L-block area, where the M1-cells are packed. The number of L-cells in L-block, denoted as NL/B, represents the complexity of circuit organization in the block. The M2-cells fill the M2-block area which is half the chip area. All cells are square, and the cell's side length is denoted as lL, lM1, and lM2, for L-, M1-, and M2-cell, respectively. The cell's side length, the block area, and the number of cells in the block are mutually dependent parameters.

We suppose a scenario of technology development in which the complexity of building block circuits in L-block, hence, NL/B, is invariant across the successive generations of microprocessors. Also assumed to be invariant is the design policy regarding the memory capacities associated with the L-blocks. An M1-block serves an L-block, and its relative capacity is represented by a ratio: (the number of M1-cells in M1-block)/(L-cells in L-block). Since this ratio can be extended to the whole chip, we write ρM1/LNM1/NL, where NM1 is the number of M1-cells and NL the number of L-cells on the chip. The M2-block serves all the M1-blocks on the chip, and the capacity ratio is ρM2/M1NM2/NM1, where NM2 is the number of M2-cells on the chip. In our scenario, these ratios, ρM1/L and ρM2/M1, are fixed across the successive generations of microprocessors.

We choose a reference microprocessor as a starting point of scenario development and specify the values of some of the parameters involved in the model. The reference microprocessor of our choice is SPARC64 X [14], first shipped in 2012. The model corresponding to this microprocessor has 16 cores, and its parameter values are specified in Table 4. In the process of specifying the parameter values, some information from another microprocessor [15] is used. The process of parameter evaluation is described in Appendix A. In the following, we will estimate how the dimensions and populations of cells on the chip respond as we vary the number of cores to below and above 16. Hence, the 16-core model provides an anchor point in our extrapolation study; we will refer it as “anchor-point model.” For the anchor-point model, the values of the parameters derived from the original literature [14,15] are modified such that some numbers are rounded, while others are tailored to fit the cells in the floor plan of Fig. 13.

We set some of the parameters invariant while we vary the number of cores, NLB. The fixed parameters are as follows: the chip side length, lC0 (20 mm); the number of L-cells in L-block (core), NL/B (3.125 × 104); the cell number ratios, ρM1/L and ρM2/M1 (40.01 and 111.11, respectively); and the ratio of M1-block height to L-block height, βM1B = lM1B,y/lLB,y (0.15). Also assumed to be invariant is the area occupied by the M2-block; the M2-block extends the chip width and the chip's half-height (10 mm).

A new value of NLB is set as 2n× 16, where n is an integer (negative and positive). Division of NLB to nLB,x and nLB,y sets a floor plan, where the block dimensions become lLB,x=lC0/nLB,x, lLB,y=lC0/(2nLB,y), and lM1B,y=βM1BlLB,y. The total numbers of cells on the chip follow from the invariance of NL/B, ρM1/L, and ρM2/M1 as NL=NLBNL/B, NM1=NLρM1/L, and NM2=NM1ρM2/M1. The side lengths of the cells are calculated from Display Formula

(1)LcelllL=lC0(1βM1B)/2NL
Display Formula
(2)M1celllM1=lC0βM1B/2NM1
Display Formula
(3)M2celllM2=lC0/2NM2

In Fig. 14, the cell populations on the chip are plotted against the number of cores which is varied from 4 to 1024. The number of M2-cells dominates the cell population on the chip, so that the sum NS = NL + NM1 + NM2 and NM2 overlap. By contrast, the number of L-cells, NL, is a marginally small fraction of NS, while the L-cells occupy about half the chip area.

The reference microprocessor, SPARC64 X [14] (shipment year 2012), has its predecessors, SPARC64 VII [16] (2008) and SPARC64 VIIIfx [17] (2010), and successor, SPARC64 XIFX [18] (2015). In the successive generations, the number of cores has increased from 4 (SPARC64 VII), 8 (SPARC64 VIIIfx), 16 (SPARC64 X), to 34 (SPARC64 XIFX). The literature reports the data of total transistor counts on the 8-, 16-, and 34-core chips. These data are included as open circles in Fig. 14.

The data from actual processors fall close to the total number of cells, NS, and the number of M2-cells, NM2. Close proximity of the transistor counts to the cell populations corroborates the supposition that the M2-cell is composed of utmost a few transistors. Also, agreement of the trend of model-based cell population and that of actual data gives us confidence in extending the modeling to future generations of many-core processors.

It should be noted that the quantitative prediction of cell population is valid for this particular family of microprocessors. Other microprocessor families are designed on the basis of different combinations of L- and M-cell populations, that is, different architectures. Hence, for a different microprocessor family, the calculations need be repeated supposing a different anchor model and setting the invariant parameters to appropriate values. However, comprehensive coverage of various microprocessor families is outside the scope of this work. In Secs. 5 to 7, we study the morphology of computing systems composed of microprocessor chips. There, we will continue to use the model developed as above.

The minimum length scale is the side length of M2-cell, lM2, and it is related to the technology node F in nanometer. The conversion of lM2 to F is based on the observation of the data in Fig. 10; that is, F = 105 × lM2, where lM2 is in millimeter. Figure 15 shows the plot of F against the number of cores, NLB. The calculations give realistic values of the technology node. Open circles are the values of the technology node reported in the literature on SPARC64 family [14,1618].

Again, it is cautioned that the relationship between F and NLB exhibited in Fig. 15 is valid only for this microprocessor family. While the technology node is the measure of progress of circuit fabrication technology, the number of cores is in general open to designer's option. For example, the experimental 80-core processor [15] is designed on the 65 nm technology node. Nonetheless, Fig. 15 is the manifestation of growing technical challenge facing the microprocessor development. As we increase the number of cores beyond 100, we will have to reduce the technology node to a single digit figure.

In this section, we examine the system packaging referring to the data of server computers of the 2000s. We work with a model where the logic chip has a circuit layout of a microprocessor; that is, the cores (L-blocks) and level-2 memory cells (M2-cells) share the chip area. The model approximates a server computer of the early 2000s [19,20]. Some of the approximations are about the physical dimensions of components and others to simplify components layout. The logic chip of this model is an 8-core microprocessor, illustrated in the lower left corner of Fig. 16. This is an LC0-card in the previously defined terminology and customarily called CPU chip. Two levels of memory, level-1 and level-2, (cache memories) are co-implemented with cores (L-blocks) on the chip. The system has two additional levels of memory, level-3 and -4. The level-3 memory cells (M3-cells) are packed on the memory chip designated as M3C0-card. The memory chip has the same area with the CPU chip, that is, 20 mm × 20 mm. The next level structure is a multichip module (MCM). The MCM has 8 × CPU chips and 8 × level-3 memory chips in its area, 100 mm × 100 mm. We assign “1” to MCM in the structural level numbering; hence, MCM is LC1-card.

At a still higher level, that is, level 2, we have a PWB, designated as C2-card. The PWB has a large area, 600 mm × 840 mm, close to 1 m2 that is an upper bound set by the cost of manufacturing. On the PWB, the MCMs are laid out in two rows with four in each row. The PWB area between the MCM rows is occupied by level-4 memory packages. Each level-4 memory package (M4C0-card) occupies an area 20 mm × 20 mm, and 528 × memory packages are laid in 24 columns × 22 rows.

The technology node for this generation of microprocessor is 45 nm (Fig. 15). This is the nominal technology node number, and it is slightly modified to 42.43 nm in calculations to maintain the number of L-cells in L-block (core), NL/B, at 3.125 × 104 specified for the anchor-point model in Table 4. The side length of L-cell in this 8-core chip is lL = 26.07 μm. The number of L-cells per chip is NL/C0 = 2.223 × 105, that per MCM (LC1-card) NL/LC1 = 1.778 × 106, and that per PWB (C2-card) NL/C2 = 1.423 × 107.

We suppose that the minimum cell length corresponding to the technology node applies to level-2, -3, and -4 memory cells; that is, lM2 = lM3 = lM4 = 0.424 μm. The cell populations per chip are as follows: NM2/chip = 1.111 × 109, NM3/chip = NM4/chip = 2.222 × 109, and those per PWB, NM2/PWB = 7.110 × 1010, NM3/PWB = 1.422 × 1011, and NM4/PWB = 1.173 × 1012. The level-1 (M1) memory cell has a side length lM1 = 1.731 μm, and the cell population on the chip is NM1/chip = 1.001 × 107.

Some of these memory cell populations are compared with the memory capacities in an actual server computer, PRIMEPOWER 2500 [20]. Comparison involves the conversion of original data reported in units of byte to the measure of information contained in M bits (log2M [21]). The conversion proves that the memory capacities assumed in the model and those of the actual computer are mutually close; their values of information content fall within 3–17%.

The ratios of memory capacities on PWB in the model are as follows: ρM4/M3 = 8.25, ρM3/M2 = 2, and ρM2/M1 = 111. Note that the last figure for ρM2/M1 is equal to that assumed for the anchor-point model processor (Table 4). This coincidence symbolizes the continuity of memory partition policy from the 8-core to the 16-core processor chip. (In chronological order, the 8-core chip preceded the 16-core, so that the architecture is actually developed first for the 8-core chip.)

Comparison of the numbers calculated for this model with those for model 3 of the mid-1990s server illuminates the composition of the early 2000s server processor. The length scale of main memory cells is an order of magnitude smaller than that in model 3 (0.45 μm versus 2.887 μm). Miniaturization of memory cells allowed the accommodation of about 1012 cells on the PWB that is about two orders of magnitude larger than 6.6 × 1010 cells in the two-board system (Fig. 8). On the other hand, the length scale of L-cell, lL, is comparable to that of model 3 (27.65 μm versus 28.05 μm). Concomitantly, the population of L-cells per PWB, NL/PWB (= NL/C2), is of a similar magnitude of that in model 3 system (1.423 × 107 versus 3.456 × 107). From these numbers, we see that dense packing of a greater number of memory cells is the notable characteristic of recent generation server processor.

The PWB (C1-card) of Fig. 16 simulates an actual PWB in a server processor. On an actual PWB, logic and memory modules are mounted together with other components such as control modules and voltage regulators. Such a PWB functions as a subsystem of a larger system, hence called “system board.” A system board composition provides convenience in the following way. That is, to expand the system's computing capacity, a certain number of boards are added to the existing boards without the need for modification or rework on the existing boards. Hence, scaling up the computing system is facilitated by using system boards, and the system board is the manifestation of modularizing building block components. In other words, scalability is brought by modularization.

In a scaled-up system, multiple PWBs are housed in a rack, as shown in Fig. 17. Typical dimensions of the rack are 1 m (width) × 1 m (depth) × 2 m (height). The PWBs are placed in an array with the placement pitch, lC2,z. Two PWB arrays at maximum can be accommodated in the rack, one above the other.

The number of PWBs in the rack gives the sum of cell populations, NS = NL + NM1 + NM2 + NM3 + NM4, which we call “system size.” Conversely, for a given system size, NS, the required number of PWBs is calculated. There is an upper bound for the number of PWBs that can be accommodated in one rack. The upper bound is defined by the minimum board placement pitch, the rack width (1 m), and the number of arrays (two). The minimum board placement pitch is set at 25.4 mm. This is a lower bound for the slot pitch on a mother board (not shown in Fig. 17) that provides interconnections between PWBs. The upper bound on the number of PWBs in the rack is calculated as 76. Where the required number of PWBs is less than this upper bound, a spare space is created in the rack. Where the number of PWBs is less than 38 (the upper bound for one PWB array), the PWBs are placed in one array with a pitch wider than 25.4 mm, and a half of the rack space can be used for other purposes such as installation of power supply units, hard disk drives, pumps, or blowers. In cases where the required number is between 38 and 76, a sensible option is to place PWBs in two arrays with a placement pitch wider than 25.4 mm.

Figure 18 shows the plot of the system size versus the PWB placement pitch, lC2,z. Also shown on the horizontal axis is the number of PWBs in array, NC2/array. The maximum system size for one rack is NS = 1.109 × 1014, where the PWBs are packed in two arrays with the placement pitch of 25.4 mm. A space between the neighboring PWBs is an important parameter for cooling design. A wide space allows a large degree-of-freedom in cooling design. In the early 2000s server [19], the inter-PWB space is around 100 mm, allowing installation of large air-cooled heat sinks on the MCMs. To be noted from the data plots in Fig. 18 is the rate at which the inter-PWB spacing shrinks with increasing the system size in one rack.

Accommodation of cells more than the above maximum number (1.109 × 1014) in one rack is made possible by denser packing of logic and memory cells than what is achieved by the cell deployment scheme of Fig. 16. The cell accommodation capacity by one-rack system is determined by not only the cell dimensions but also the scheme of cell deployment. As we will see next, the concentration of L-cells on the chip and further on the PWB helps to reduce the number of PWBs to accommodate a given NS in one rack. In such cell deployment, some PWBs carry only L-blocks and others M-cells. The PWB in the former category is designated as LC2-card and that in the latter as MC2-card. Certain numbers of LC2- and MC2-cards form a unit; the MC2-cards serve the LC2-cards in the same unit. Figure 17 includes a sketch of a unit, and the number of units in the rack is denoted as Nunit.

The circuit layout of Fig. 16 is characterized by the colocation of logic and memory blocks on the chip, the MCM, and the CPU PWB. In this section, we consider the cases where the logic chips are fully loaded with cores (LC0-cards carrying only L-blocks), and the multichip modules (MCMs) carry only logic chips (LC1-cards containing only LC0-cards). Memory cells of higher than level 2 have the equal dimensions, and they are packed on memory chips. The memory chips are loaded directly on printed wiring boards (PWBs, C2-cards). Among possible layouts of the components, we focus on two layouts. In layout A, the PWB (C2-card) carries a mix of MCMs (LC1-cards) and level-2 memory chips (MC0-cards). An example of layout A is shown in Fig. 19. In layout B, the MCMs and level-2 memory chips are mounted on separate PWBs. An example of layout B is shown in Fig. 20. In both layouts, the level-3 and -4 memory chips are mounted on the memory PWBs (denoted as MMC2-cards). All the PWBs in the system are assumed to have equal dimensions. Table 5 is the list of fixed parameters and their values.

The parameter values in Table 5 are equal to those assumed for the 8-core system of Fig. 16. Some of them are also equal to those specified for the 16-core microprocessor model in Table 4. Thus, we maintain consistency in the basic parameters throughout the model generations. The primary variables are the system size (NS) and the number of cores (L-blocks) on the chip (NLB/LC0).

Since the L-cell population in the core (NL/B) is fixed in this model, the physical size of L-cell (lL) needs to be decreased with increasing the number of on-chip cores (NLB/LC0). Thus, the number of on-chip cores is restricted by the circuit fabrication technology, and the technology is represented by the value of technology node (F). Table 6 defines the correspondence between NLB/LC0 and F, where, besides the nominal values, the F values used in calculations are shown. Some of the nominal F values are slightly modified to fit the specified number of L-cells (NL/B) in the core area as well as to maintain the ratios of memory capacities to L-cell population at specified values (Table 5). The calculations steps are described in what follows.

First, we set a number of on-chip cores (NLB/LC0) choosing the number of L-blocks on the row (nLB,x) and that on the column (nLB,y) so as to make the ratio nLB,x/nLB,y = 1 or 2. The number of L-cells on the chip follows from Display Formula

(4)NL/LC0=NL/BNLB/LC0

where NL/B is given as 3.125 × 104. Since the layout of chips in MCM is fixed (Table 5), the number of L-cells in MCM (LC1-card) follows from Display Formula

(5)NL/LC1=NL/LC0NLC0/LC1

where NLC0/LC1=lLC12/(lC0+lU0)2. The number of M1-cells on the chip is NM1/LC0=NL/LC0ρM1/L. The side length of L-cell is calculated using Eq. (1) with 2NL being replaced by NL/LC0. The side length of M1-cell is calculated using Eq. (2) with 2NM1 being replaced by NM1/LC0.

As for memory cells of level-2 and higher, we assume the uniformity of memory cell dimensions across the levels; namely, lM2 = lM3 = lM4 (denoted as lM). The memory cell side length is related to the technology node F (nm) as lM=105F (mm). These assumptions are consistent with those made earlier for the 8-core processor-based system of Fig. 16. The memory cell population on the chip is calculated from Display Formula

(6)NM/MC0=(lC0/lM)2

Equations (4)(6) involve the parameters specified in Table 5 and the number of on-chip cores; thus, they are independent of the system size.

We now consider construction of a system of size NS, where NS is the total number of cells involved in the system Display Formula

(7)NS=NL+NM1+NM2+NMM

In the above equation, NL, NM1, and NM2 are the populations of L-, M1-, and M2-cells in the system, respectively. NMM is the sum of M3- and M4-cells in the system. The memory chips carrying level-3 and -4 memory cells are supposed to be mounted on the PWBs separately from those carrying MCMs and lower-level memory chips. Hence, the M3- and M4-cells will be treated collectively and denoted as MM-cells hereafter.

Since the population ratios are specified as in Table 5, the terms on the right-hand side of Eq. (7) are individually related to NS. The population of L-cells in the system is written as Display Formula

(8)NL=NS/ρS/L

where ρS/L=1+ρM1/L+ρM2/L·(1+ρMM/M2), ρM2/L=ρM1/L·ρM2/M1, and ρMM/M2=(ρM4/M3+1)·ρM3/M2.

The populations of M2- and MM-cells in the system are written as, respectively Display Formula

(9)NM2=ρM2/LNL
Display Formula
(10)NMM=ρMM/M2NM2

The cell populations of Eqs. (8)(10) are converted to the numbers of MCMs and memory chips in the system as follows:

Number of MCMs (LC1-cards)/system Display Formula

(11)NLC1=NL/NL/LC1

Number of level-2 memory chips (M2C0-cards)/system Display Formula

(12)NM2C0=NM2/NM/MC0

Number of level-3 and -4 memory chips (MMC0-cards)/system Display Formula

(13)NMMC0=NMM/NM/MC0

We want to determine whether a system of a given size can be housed in one rack. Further, where one-rack system is possible, we ask how to deploy MCMs, memory chips, and PWBs so that a spare space is created in the rack for installation of cooling devices and other purposes. In addressing these issues, we assume a policy on the assembling of functional parts. The policy dictates that MCMs (LC1-cards) and level-2 memory chips (M2C0-cards) are placed in physical proximity, but it allows two options as illustrated in Figs. 19 (layout A) and 20 (layout B). We will use the term “unit of cards,” “card unit,” or simply “unit” to mean a set of PWBs where minimum numbers of L- and M2-cells are accommodated in a specified proportion of ρM2/L. In layout A, the unit is a PWB of Fig. 19, where some of memory chip lots are left unoccupied to have the M2-cell population meet the condition set by ρM2/L. In layout B, the unit is a set of MCM-laden PWBs (LC2-cards) and level-2 memory PWBs (MC2-cads). The memory PWB in layout B has 1584 chip lots on its two sides, and the number of filled lots depends on the L-cell population in the unit. The formulas to determine the components layout are described in Appendix B. The number of PWBs in the C2-unit is summarized as follows: Display Formula

(14)ForlayoutA,NC2/unit=1ForlayoutB,NC2/unit=NLC2/unit+NMC2/unit5NLC2/unit4,NMC2/unit=1

The condition for the C2-unit with layout B is described in detail in Sec. B.2 in Appendix B.

Meanwhile, the memory chips of higher levels (MMC0-cards) are accommodated in a block of MM-PWBs (MMC2-cards). In one-rack system, these memory PWBs share a rack space with the units of LC2- and MC2-cards, so that they pose an overhead on the space budget in the rack. The steps to determine the number of MM-PWBs (MMC2-cards) in the system, NMMC2, are described in Appendix C. The total number of PWBs in one-rack system, NC2/R, is written as Display Formula

(15)NC2/R=Nunits/sysNC2/unit+NMMC2

where Nunits/sys is the number of C2-units in the system. These PWBs are housed in a rack under the following conditions:

  • High-level memory PWBs (MMC2-cards) are packed with the placement pitch of 25.4 mm (denoted as l*C2,z) which is the minimum pitch of PWB placement defined by the connector slot pitch on the mother board.

  • The placement pitch of PWBs in C2-unit (C2-cards in layout A; LC2- and MC2-cards in layout B), called hereafter CPU-PWBs, is restricted from below by l*C2,z = 25.4 mm. Where the space allows, the placement pitch of these cards is widened to increase the intercard space.

Calculations are performed to determine how large a system can be housed in one rack. Also, we are interested in the following issue. The placement pitch for PWBs in C2-unit depends on the layout scheme of logic and level-2 memory components (layouts A and B). The details of calculations are described in Appendix C.

Figure 21 shows the placement pitch of CPU-PWBs, lC2,z, versus the total system size, NS. The parameters are the number of on-chip cores, NLB/C0, which is varied from 64 to 1024, and the layout type (A, B). The curves have the labels showing NLB/C0 and the layout type. Large lC2,z means small number of CPU-PWBs in the rack. As noted previously, the increase in the number of on-chip cores is supposed to follow the progress on the technology node scale. Obviously, with the advancement on the technology node, a wider space becomes available in the rack. The effect of layout type on the PWB placement pitch is noteworthy. With NLB/C0 given, the curves for layouts A and B are mutually close in a range of relatively small system size NS. With increasing NS, layout B yields an obvious advantage over layout A in relaxing the spatial condition in the rack (increasing lC2,z). That is, concentration of L-cells to dedicated PWBs and accommodation of M2-cells in separate PWBs is a way to reduce the total number of PWBs in the rack. The advantage of layout B over A comes largely from the increased cell accommodation capacity of memory PWBs. That is, the level-2 memory chips (MC2-cards) are mounted on two sides of the PWB. Where the logic processing components (MCMs) are present, use of the two sides of PWB to mount components is generally difficult. This is because the PWB with MCMs has to accommodate complex wirings in high routing density, and two-side mounting of memory chips further complicates wiring in PWB, an economically and technically unfavorable option.

We now place this conclusion in a long-term perspective. As we have seen in the systems of the 1980s–1990s, before the arrival of microprocessors, the L- and M-cells are borne by their, respectively, dedicated chips. With the advent of multicore processors, the L- and M2-cells have come to share the chip area. The multicore processor chip serves as a subsystem of the computing system, and thus has brought the benefit of modular construction of complex systems. Modular construction is then extended to higher level components by co-implementing logic and memory chips in the MCM, and MCMs and higher level memory chips on the PWB (Fig. 8). On the other hand, the demand for ever larger computing capacity mounts, and we suppose a situation where the required number of cells exceeds the cell accommodation capacity of circuit platform. The chip as a circuit platform may come to lack a capacity; then, the M2-cells need to be moved out to separate chips. The conventional microprocessor chip is transformed to a chip filled with cores (L-blocks, Fig. 19). Likewise, the memory chips are moved out of MCM as the MCM needs to accommodate increasing population of L-cells (Figs. 16, 19, and 20). In the discussion on layouts A and B (Figs. 19 and 20), the circuit platform of our concern is the PWB. The CPU-PWB in layout B is the manifestation of the trend of L-cell concentration on the dedicated platform.

The curves in Fig. 21 suggest that the increase in the number of on-chip cores helps to accommodate a system of larger size on a fewer number of PWBs. Since the number of on-chip cores is related to the technology node in the present model, following Moore's law we can increase the capacity of the system to accommodate more cells. Hence, the system's geometric morphology results from the confluence of two drivers, the demand for increasing system size, and the progress on the scale of technology node. The pace of circuit miniaturization, however, has been slowing in recent years due to the increasing cost of circuit fabrication. It is most likely that the demand for larger system size will outpace the progress of circuit miniaturization in the future, and the system's hardware will have to take on a morphology where the L-cells are mounted on dedicated PWBs, and the PWB placement pitch is narrowed to a minimum.

The upper bound for PWB placement pitch of 1000 mm set in Fig. 21 is an artifact of calculation. Where the calculated pitch hits or nears this ceiling, the system involves only a few PWBs and the PWBs can be deployed in a form similar to the one we have seen in Fig. 6. A few examples from the system based on 256-cores processor chips are given as follows. To accommodate 1.33 × 1014 cells, the system with layout A requires one CPU-PWB (NC2 = 1) and three memory PWBs (NMMC2 = 3). Then, these four PWBs can be placed in a plane like those in Fig. 6. (Here, a slight expansion of the rack width is required to contain two PWBs in row. But such modification is immaterial to the present discussion.) In a system having 5.322 × 1014 cells with layout B, the number of CPU-PWBs constituting a C2-unit is three (NC2 = 3), while that of memory PWBs increases to nine (NMMC2 = 9). In this case, the CPU-PWBs can be deployed in plane, while the memory PWBs are packed in an array in a corner of the rack. We have seen similar components placement in Fig. 8.

Two-dimensional deployment of CPU-PWBs is particularly convenient to remove heat from the chips. However, the present study points out that a convenient situation for cooling chips exists only where the system has an appropriate size corresponding to the technology node, and favorable coupling of the system size with the technology node is found in a limited class of computers.

For large systems involving more than 1016 cells, we need multiple racks even when the progress on the technology node enables 1024 cores on the chip. We assume the following policy of packaging large systems:

  • The PWBs carrying C2-units are housed in dedicated racks, C2R, and the PWBs carrying high-level memories (MM-PWBs) are housed in dedicated racks, MR. The numbers of C2R and MR racks in the system are NC2R and NMR, respectively. The total number of racks in the system is Display Formula

    (16)NR=NC2R+NMR

  • In every rack, the PWBs are packed with the minimum placement pitch l*C2,z and in two arrays.

The primary variables are the system size, NS, and the number of on-chip cores, NLB/C0. With other parameters fixed (Table 5) and NLB/C0 related to the technology node F (Table 6) as before, the required number of racks NR is determined. The calculation steps are described in detail in Appendix C.

Figure 22 shows the required number of racks in the system (NR) versus the system size (NS) for some sample systems. The parameter is the number of on-chip cores (NLB/C0), varied from 32, through 64, 256, to 1024. The PWB placement pitch (lC2,z) is set at 25.4 mm for all samples but one. For the system based on 32-core processors, the case of a 100 mm placement pitch is added. The curves are the results for layout B.

The results presented in Fig. 22 can be interpreted as follows. The rack is fully packed with PWBs, the PWBs are fully packed with MCMs or memory chips, and the MCMs are fully packed with logic chips (LC0-cards). Hence, the cell population in the rack is proportional to the rack volume. Meanwhile, the population of cells of any type, L- or M-cells, is in linear proportion to the system size, NS. Thus, the number of fully packed racks (NR) becomes proportional to NS.

The number of on-chip cores (NLB/LC0) has the effect on the rack population in the system through the geometrical relationship between the cores and the space in the rack. To explain the relationship, we need to go through a chain of equations that relate NLB/LC0 to the L-cell population on the chip (NL/LC0), and further relate the dimensions of lower-level components to higher-level components. It can be shown that the rack population in the system (NR) is in inverse proportion to NLB/LC0. For example, the increase in NLB/LC0 from 64 to 256 reduces NR by a factor of four.

The effect of PWB placement pitch on the rack population is also straightforward. A wide 100 mm pitch increases NR by a factor of four compared to the number obtainable with 25.4 mm pitch, as illustrated by the curves for the 32-core processor based system.

Since the rack has a 1 m2 footprint area, NR is a measure of the area to be provided for the deployment of the whole system. Note that the actual system area is several times larger than the sum of rack footprints due to the need to provide aisles and service areas on the computer center floor. Even without counting such extra areas, the area occupied only by the racks is already very large in the range of system sizes on the horizontal axis of Fig. 22. The figure includes the belts marked by broken lines that indicate the range of areas occupied by homes (100–200 m2), data centers in the metropolitan area (500–2000 m2), and data centers located in remote areas (7000–20,000 m2). To curb the areal expansion of very large systems is an imperative issue for further development of the information-based society. Suppose that the volume of information to be processed by a data center or a supercomputer center increases 100-fold, hence, we need a 100-fold increase in the system size. Extension of the existing technology corresponds to following the line of a specific on-chip core number. That is, a 100-fold increase in the system size requires a 100-fold increase in the computer center area. A way to curb the area expansion is to advance the circuit fabrication technology. For example, suppose that a data center occupying 7000 m2 area (in terms of the racks footprint) is built using 32-core processors, and the 32-core corresponds to a technology node of 30 nm (Table 6). If the technology node is advanced to 5 nm and 1024-core processors become available, the center floor does not have to be expanded to accommodate a system whose size is 100 times that of the previous 32-core processor-based system. Such scenario is unlikely due to the mounting difficulty to follow Moore's law in further circuit miniaturization.

Meanwhile, the demand for large-scale information processing is unabated. We need to explore ways to suppress the expansion of system area besides circuit miniaturization on the chip. A straightforward way is to increase the packing density at all structural levels, that is, MCM-, PWB-, and rack-levels. Thus, narrowed space in the rack will impose a paramount constraint in the design of very large systems. In the mid-2010s, a PWB placement pitch of 100 mm is common in server computers of data centers and supercomputers. As we see in the example of 32-core processor-based system (Fig. 22), the PWB placement pitch has a linearly proportional impact on the system's physical size. For example, consider the case of 32-core processor system. Decrease of the PWB pitch from 25.4 mm down to, say, 2 mm, reduces the rack population by an order of magnitude. If a curve for 2 mm PWB pitch is included in Fig. 22, it would almost overlap with the curve for a 256-core processor system with 25.4 mm pitch. This numerical example is an illustration of the tradeoff between the investment to advance circuit miniaturization and the elaboration of thermal solution to remove heat from constrained inter-PWB spaces.

High-density packaging with limited space for cooling motivates the adoption of liquid as coolant. Where the coolant path width is reduced to a few millimeters, immersion cooling provides a practically viable thermal solution. We have the pioneer machines in this respect; immersion cooling of densely packed electronic components was already employed in CRAY supercomputers of the late 1970s and the 1980s [22]. In CRAY-2, FC77 was circulated in the paths of about 2 mm width. The chassis housing CPU has a circular footprint of 1.35 m in diameter; hence, the footprint area is about 1.4 m2. CRAY-2 achieved a computation speed of 160 mega FLOPS. By the mid-2010s, the performance of supercomputers has reached the level of several tens of peta FLOPS ( ∼ 1016 FLOPS). The K-computer, one of the current generation supercomputers, is housed in a multistoried building where the total floor area amounts to around 10,000 m2. The microprocessor chip for the K-computer has evolved from 8 [17] to 16 [18] cores, and the corresponding technology node from 45 nm to 28 nm. Indirect water-cooling was employed for the CPU modules. On the PWB, the water-cooled modules and the air-cooled memory modules are comounted, so that the interboard spacing is supposedly more than 100 mm [23]. The space for cooling is constrained by the multiboard construction of the K computer, and the water tubing is squeezed in a tighter space than that afforded in the 1990s mainframes. In Ref. [4], the dimensional and performance data of CRAY, K-, and other sample supercomputers are used to construct the corresponding models. The system size of the model corresponding to CRAY is calculated as 3.20 × 106, and that of the K computer is 5.14 × 1016. In Fig. 22, these system size numbers are outside (left of) the range of the horizontal axis, suggesting that Fig. 22 is for future systems of much larger sizes.

A computer capable of delivering exa (1018) FLOPS may find a spot in Fig. 22. It is estimated in Ref. [4] that a future exa-scale supercomputer may require around 4 × 1018 cells. An exa-scale system may be housed in the same building of the current peta-scale system if the following condition is met; the core number on microprocessor chips is increased to a range 32–64 with the corresponding progress of technology node from 30 nm to 20 nm, and the PWB pitch is reduced to 25.4 mm. Some notes are due regarding this prediction, and they are described in Secs. 7.1 and 7.2.

System's Power Consumption.

The power consumption by the computer has so far been set aside from the present discussion. When the power issue is brought into consideration, we use the metrics of progress defined as

Computational efficiency = FLOPS/Psys

where Psys is the power consumption by the system in Watts, so that the dimension of the computational efficiency is (floating point operations/Joule)

Computational density = FLOPS/Vsys

where Vsys is the space occupied by the computing system measured in liter, so that the dimension of the computational density is (operations/(s·L)). From the historical data of computers over many decades, a revelation is reported in Ref. [3]. The data from diverse classes of computers fall close to a diagonal line in the logarithmic-scale graph of efficiency–density. Besides, the line of technological evolution is heading toward the upper right corner of the graph where the data from biological brains are found. Interestingly, human efforts on computers have unintentionally followed the course toward emulating the biological brain in the end. However, this fortunate development holds until the era of peta-scale computing [4]. As pointed out in the previous paragraph, from the geometric viewpoint, an exa-scale computer may be housed in a building for an existing peta-scale system, if the circuit miniaturization is achieved as required. The problem is a projected power requirement for an exa-scale system if the design depends on the extension of the current circuit technology. Unless breakthroughs are made in circuit technology, the system's power requirement would be pushed to an unacceptably high level, and the system's state point in the efficiency–density graph would fall off the evolution line. The analysis was developed in Ref. [4] to find how the power consumption must be lowered in order to stay on the evolutionary course. The case study assumed a system composed of PWBs in array and cooled by FC77; the PWB has 1 m2 area and is sufficiently thick to accommodate power supply and signal lines. The result points out that the coolant path width has to shrink to 240 μm, while the constraint on power consumption demands a low heat flux on the coolant path channel on the order of 10 W/m2 or 0.001 W/cm2. The length/width ratio of such coolant channels amounts to 4000. This number is in contrast to the ratio assumed in the existing studies on microchannel cooling which is around 200. The current interest in microchannel cooling originated from the pioneering work by Tuckermann and Pease [24] where the objective is to minimize the thermal resistance from the chip to the coolant. Compared to the on-chip microchannels, the prospective channels in future large-scale systems may be called very long microchannels (VLMCs). The heat flux, the temperature rise of coolant, and the pressure drop in VLMC are very low or at a modest level. This does not mean the disappearance of thermal problems in future large systems; rather, a new class of thermal problems will emerge. The VLMCs embedded with densely packed electronic components will undergo deformations due to thermomechanical stresses imposed on the components assembly. Where the channel is blocked by the occurrence of severe deformation or some other mechanisms, the temperature rises to a dangerously high level even though the local heat flux is low. The distribution of coolant flow in a network of VLMCs must be controlled in order to create desired temperature distribution throughout the system and avert thermal crisis in the event of local channel blockage. In this respect, the thermal control of future systems must emulate the process in biological brains.

Whether we would be able to stay on the existing evolutionary course in developing computers of exa- and zeta-scales, and even beyond, is still uncertain. Considerable research efforts are being waged to reduce power requirement of computation, but none of the schemes currently under study have yet produced an optimistic outlook. Nonetheless, there emerges a concept that may influence the future course of development of large-scale computing. The concept is concerned with architectural organization of the computer. Section 7.2 explains its relevance to the present study.

Relative Locations of Logic and Memory Cells.

Computers, for which we have developed the geometric models in Secs. 3 to 5, have memory cells implemented on platforms that are physically distinct from logic processing platforms. This is the basic architecture adopted in computer design since the beginning of modern computers and called von Neumann architecture. One of the primary factors that determine the performance of von Neumann computers is the time required for logic cells to access memory cells. Among the parameters that govern the access time is the physical distance between logic and memory cells. To reduce the logic/memory communication distance, the locality rule is exploited. By the locality rule, the logic circuit requires only a small volume of memory for a certain period of time. Thus, the memory is partitioned into several hierarchical levels, and the low-level memory blocks and logic blocks are comounted on chips, modules, and PWBs.

Meanwhile, as concluded in Sec. 6, concentration of logic blocks on dedicated PWBs slows the expansion of system volume. Following this conclusion means a departure from the architectural evolution pursued by von Neumann computers. Meanwhile, departure from the established course is not entirely irrational. It is now a widely shared recognition in the computer community that von Neumann architecture is losing its effectiveness in large-scale computation. Even though the memory access rate is improved by placing data at multiple hierarchical levels, the bottleneck for data flow eventually emerges at the interface to the main (highest level) memory. This so-called “memory wall” is the inherent disadvantage of von Neumann architecture. As the memory wall is looming high, the search for non-von Neumann architecture intensifies [25]. In near-memory or in-memory computing, the memory blocks are provided with logic circuits that process data locally within the block, so that transfer of raw data on long communication lines to CPU is avoided. This is tantamount to migration of logic cells into memory blocks. Where the migration process is advanced to an extreme end, the logic and memory functions are thoroughly blended, and all the cells perform logic functions. The blending of logic and memory functions in hardware is not a hypothetical artifact, but fundamental in the neural network organization, where memory is embedded as weights assigned to lines linking logic processing nodes [25]. The system considered in Ref. [4], which assumed homogeneously packed logic cells, may be interpreted as a model of non-von Neumann computer (although not explicitly declared in the paper). The result reported in Ref. [4] suggests that, even with a non-von Neumann architecture, the tightening spatial constraint will be a rule in future large-scale systems.

In summary, the recipe for the future includes shift to non-von Neumann architecture, rise of packaging density to extremely high levels, and departure from high-heat-flux cooling. As these lines of technological development are pursued, computers are approaching biological brains in the aspects of hardware organization and thermal management.

The present paper is written as an extended supplement to the author's previous paper [26] on heat transfer science and engineering for computers. The scope of the paper is bounded necessarily due to the enormous expanse of parametric domain of computer design. In a macroscopic interpretation, the evolution of computer technology is described as a move toward the corner of the parametric space spanned by the space, time, and energy axes [26]. The focus of the present study has been on the evolution projected on the space axis. In the next phase of study, the scope will be broadened taking into account those parameters pertaining to the time and energy axes. Even from the study of a limited scope, we have obtained some insights regarding the future of thermal management for computers.

The following rules are employed to construct a geometric model of computer. The system structure is built of planar components of different dimensions. The components are assembled in a hierarchical order, with smaller components accommodated in a component of higher level. The assembly of components at an intermediate hierarchy level is replicated multiple times, and a certain number of intermediate assemblies are accommodated in a component of upper level. The finest component in such hierarchical modular construction is the cell. The total number of cells to be accommodated in the system represents the system size. The cell dimension is pegged to the technology node that advances toward smaller values over the years. The system size increases over the years, driven by the demand for larger-scale computing. Meanwhile, the intermediate-level components, such as chips, multichip modules, and PWBs, have the dimensions that are slow to change due to cost constraints of manufacturing.

The system morphology of computer undergoes transitions precipitated by the competition between miniaturization of cells and increasing system size. The morphological transition is accompanied by the creation of a new spatial environment for cooling design. Where the increase in system size outpaces the rate of circuit miniaturization, the packaging density has to rise in order to moderate the expansion of system's physical volume. With rising packaging density, the space for cooling is squeezed.

There is another factor that influences the system volume, that is, the spatial deployment of memory cells. In traditional (von Neumann) computer architecture, the memory is split into multiple levels, where the level marks the capacity of memory block and the physical distance to the logic block. The memory blocks of smaller capacity, hence of lower levels, are located closer to logic blocks in order to accelerate the transfer of data between them. Spatial layout of logic and memory blocks also reflects the increase in parallel processing lines in the computer. The drive toward parallel processing has resulted in modularization of circuits, where the logic/memory units are replicated on the same platform. The multicore processor chips are the result of such drive, so are system board designs on PWB. From the present analysis, a question emerges in regard to the benefit of von Neumann architecture in very large-scale computers in the future. The expansion of system's physical volume will be moderated where the logic blocks and memory blocks are mounted on their, respectively, dedicated PWBs. However, such circuit layout increases the distance between logic and memory blocks, and obviously is detrimental to the system's processing performance.

Cloud computing and supercomputing both require disruptive developments in a broad range of technologies to leap forward beyond the present state-of-the-art. Even when we focus only on the spatial issues, we notice the needs to drastically reduce the sizes of building block components, increase the packaging density to unprecedentedly high levels, and depart from the traditional von Neumann architecture in order to contain the system volume in economically and technically acceptable ranges. Thermal management of the future has to allow the design of highly compact and power-thrift computing systems. This means departure from the traditional thermal management of large systems where rising heat flux on the components' surface has been the main concern, and the space requirement by cooling devices has been placed in the back stage.

This study has been a part of the JSME industry-academia research project “Reliability Analysis and Thermal Management of Electronic Devices and Equipment.” The author thanks the industry sponsors and colleagues for their support.

  • A =

    coefficient in Eq. (A2)

  • Ci =

    card, ith level

  • C2R =

    racks accommodating C2-units

  • C2-unit =

    PWBs constituting logic/memory unit

  • f =

    clock frequency, Hz

  • F =

    technology node, nm

  • FLOPS =

    floating point operations per second, s−1

  • L =

    logic

  • lC0 =

    side length of chip, mm

  • lC2,x =

    width of PWB, mm

  • lC2,y =

    height of PWB, mm

  • lC2,z =

    pitch of PWB placement, mm

  • lL =

    side length of logic cell, μm

  • lLB,x =

    width of core, mm

  • lLB,y =

    height of core, mm

  • lLC1 =

    side length of MCM, mm

  • lMj =

    side length of level-j memory cell, suffix j suppressed where no distinction is made between adjacent levels, μm

  • lM1B,y =

    height of level-1 memory block, mm

  • lRD =

    depth of rack, m

  • lRH =

    height of rack, m

  • lRW =

    width of rack, m

  • lUC1 =

    inter-MCM spacing, mm

  • lU0 =

    interchip spacing, mm

  • lWC2 =

    cumulative width of PWB arrays

  • lWMM =

    cumulative width of MM-PWB arrays

  • lW/unit =

    width of PWB unit

  • l*C2,z =

    minimum pitch of PWB placement, mm

  • l*R =

    2 lRW, m

  • Mj =

    level-j memory

  • MM =

    memory above level-3, applied to cases where memory cell sizes of different levels are set equal

  • MR =

    racks for memory PWBs

  • n =

    power in NLB = 2 n× 16, Sec. 4

  • nLB,x =

    number of on-chip cores in row

  • nLB,y =

    number of on-chip cores in column

  • nLC1/C2,x =

    number of MCMs in row on PWB

  • nM2C0/C2,x =

    number of level-2 memory chips in row on PWB

  • nMMC0/MC2,x =

    number of high-level (higher than 2) memory chips in row on PWB

  • nMMC0/MC2,y =

    number of high-level (higher than 2) memory chips in column on PWB

  • NC2 =

    number of PWBs

  • NC2R =

    number of racks accommodating C2-units

  • NC2/R =

    number of PWBs in rack

  • NC2/unit =

    number of PWBs in unit

  • NL =

    number of logic cells in system

  • NL/B =

    number of logic cells in core

  • NL/C0 =

    number of logic cells on chip

  • NL/LC1 =

    number of logic cells on MCM

  • NL/LC2 =

    number of logic cells on PWB

  • NLB/LC0 =

    number of cores on chip, “LC0” suppressed in Sec. 4

  • NLC1/C2 =

    number of MCMs on PWB

  • NLC2/unit =

    number of logic PWBs in unit

  • NMj =

    number of level-j memory cells in system

  • NMj/c =

    number of level-j memory cells on component c, c is chip or PWB

  • NM2C0 =

    number of level-2 memory chips in system

  • NMC2/unit =

    number of memory PWBs in unit

  • NMM =

    NM3 + NM4, applied to cases where memory cell sizes of different levels are set equal

  • NMMC0 =

    number of level-3 and -4 memory chips in system

  • NM/c =

    number of memory cells on components c, c is chip or PWB, applied to cases where memory cell sizes of different levels are set equal

  • NMMC2/array =

    number of high-level memory (higher than 2) PWBs in array in rack

  • NMR =

    number of racks accommodating memory PWBs

  • NPWBS =

    number of PWBs in system (Eq. (C17))

  • NR =

    number of racks in system

  • NS =

    total number of cells in system (system size)

  • Nunits/s =

    number of C2-units in space s, s is rack space or system space

  • Psys =

    power consumption by system, W

  • rM2C0 =

    variable to determine the number of M-PWBs in unit

  • Vsys =

    system volume, l

  • z =

    variable to determine the number of racks in system (Eq. (C22))

 Greek Symbols
  • βM1B =

    ratio of level-1 memory block height to core height

  • ρMj/L =

    ratio; level-j memory cell population/logic cell population

  • ρMj + 1/Mj =

    ratio; level-(j + 1) memory cell population/level-j memory cell population

  • ρS/L =

    ratio; total number of cells in system/total number of logic cells in system

 Superscript
  • * =

    limit value

Appendix A: Derivation of Parameter Values Listed in Table 4

The reference microprocessor, SPARC64 X [14], has 16 cores on an area 540 mm2. The technology node is at 28 nm. The number of transistors on the chip is 2.95 × 109. This number represents the number of M2-cells on the chip. Meanwhile, the number of L-cells on the chip, NL, is the population of computing devices. To relate the L-cell population to the computing performance, we use a simplified model of signal flow in parallel process lines and write (Eq. (2) in Ref. [26]) Display Formula

(A1)FLOPSf·NL

where f is the clock frequency. The reference microprocessor (SPARC64 X) achieves 382 × 109 FLOPS at f = 3 × 109 Hz. The NL is the product of the number of L-blocks (cores), NLB, and the number of L-cells in L-block, NL/B, i.e., NL=NLBNL/B. Equation (A1) is written in the form Display Formula

(A2)(FLOPS/f)2NLB=ANL/B

To determine the proportionality constant, A, we refer to another source which describes an experimental 80-core microprocessor [15]. In this source, we find a specific number of transistors constituting an L-block, which is quoted as 1.2 × 106 in a tile (L-block) of 3 mm2. To convert the transistor population to the number of L-cells, we refer to Ref. [27] in which we find a typical logic gate (fan-out 4) involves six transistors. Hence, for L-block on this chip, NL/B = 2 × 105. The experimental chip achieves 320 × 109 FLOPS at f = 109 Hz. Substituting these values along with NLB = 80 into Eq. (A2), we have A = 0.0064. This processor, however, employed transistors the length scale of which was at a technology node of 65 nm. To account for the effect of technology node, we write A = 0.0064 (65/F)2, where F is the technology node in nanometer. The ratio (65/F) is squared, because the coefficient A is an area factor.

Returning to the reference processor, we substitute FLOPS = 382 × 109, f = 3 × 109 Hz, NLB = 16, F = 28 nm, and the derived A into Eq. (A2), and find the total number of L-cells on the chip as NL = NLB·NL/B = 4.7 × 105. From Ref. [14], we also find the ratio of level-2 memory capacity to that of level-1 as 187.5 (the ratio is given in terms of byte; 24 MB of level-2 memory versus 128 KB of level-1 memory).

In specifying the values for the anchor-point model, we modified some numbers. For example, the chip side length is rounded to 20 mm from 23.2 mm, and the technology node to 30 nm from 28 nm. Other modifications are made to fit the cells in the floor plan. The modified numbers are close to the data estimated from Ref. [14]. For example, the L-cell population on the chip is set at 5.000 × 105, modified from the calculated value of 4.7 × 105; the main memory capacity is 2.222 × 109 M2-cells as compared to the transistor count of 2.95 × 109; and the M2/M1 memory capacity ratio (ρM2/M1) is 111.11 as compared to 187.5.

Appendix B: About Layouts A and B
Layout A.

We denote the number of lots available for placement of L-cells on PWB (C2-card) as N*L/C2. From among a number of possible layouts, we choose the following. Two columns of MCMs (LC1-cards) are placed on PWB, so that the number of MCMs on PWB is Display Formula

(B1)N*LC1/C2=2lC2,y/(lLC1+lUC1)

The number of available lots for chips in MCM is Display Formula

(B2)N*LC0/LC1=lLC12/(lC0+lU0)2

and that for L-cells in the chip is N*L/LC0=NL/LBNLB/LC0 (Eq. (4)). Hence Display Formula

(B3)N*L/C2=NLB/LC0NL/LBN*LC0/LC1N*LC1/C2

Since we already assumed two columns of MCMs, the width available for placement of memory chips (M2C0-cards) is shrunk by 2(lLC1+lUC1), so that the number of available columns for M2-cells on PWB is Display Formula

(B4)nM2C0/C2,x=lC2,x2(lLC1+lUC1)lC0+lU0

The number of available rows for memory chips is Display Formula

(B5)nM2C0/C2,x=lC2,y/(lC0+lU0)

Hence, the number of lots available for placement of level-2 memory chips on PWB is Display Formula

(B6)N*M2C0/C2=nM2C0/C2,xnM2C0/C2,y

This is multiplied by the number of available lots for M2-cells on the chip, N*M/MC0=(lC0/lM2)2, and we write the number of available lots for M2-cells on PWB as Display Formula

(B7)N*M/C2=N*M/MC0N*M2C0/C2
The ratio of available lots for M2-cells to those of L-cells on PWB is Display Formula
(B8)ρ*M2/LN*M2/C2N*L/C2=N*M/MC0N*M2C0/C2N*L/LC0N*LC0/LC1N*LC1/C2

In the assumed layout of MCMs, that is, in two columns, the number of available lots for memory chips is larger than a value of ρM2/L specified in Table 5 (from ρM2/M1·ρM1/L)

ρ*M2/L>ρM2/L

Suppose that all the lots except for those for memory chips (M2C0-cards) are filled. Then, we fill some lots for memory chips and leave the rest vacant, and denote the number of filled memory chip lots as NM2C0/C2. The cell population ratio is Display Formula

(B9)ρM2/L=N*M/MC0N*L/LC0N*LC0/LC1N*LC1/C2NM2C0/C2

To bring ρ′M2/L close to ρM2/L, we vary NM2C0/C2 until the following condition is met:

At NM2C0/C2, ρM2/LρM2/L

At NM2C0/C2+1, ρM2/L>ρM2/L

NM2C0/C2 converges to 191 in the present examples where the component dimensions are given in Table 5 and ρM2/L = 4440. The relative difference (ρM2/LρM2/L)/ρM2/L is 0.004. This number of memory chips on PWB is applicable to all cases of the number of on-chip cores (NLB/LC0), as a result of the application of the fixed ρM2/L. Hence, the layout shown in Fig. 19 is independent of NLB/LC0.

Layout B.

In layout B, we suppose a PWB loaded with MCMs only, and we call it CPU-PWB (LC2-card). The number of L-cells on CPU-PWB is Display Formula

(B10)NL/LC2=NL/LC0NLC0/LC1NLC1/LC2

The card unit is composed of CPU-PWBs and M-PWBs, where M-PWB (MC2-card) carries level-2 memory chips. The number of CPU-cards in the unit is denoted as NLC2/unit, and that of M-PWBs is NMC2/unit. Hence, the number of PWBs in the unit is Display Formula

(B11)NC2/unit=NLC2/unit+NMC2/unit

The number of L-cells in the unit is Display Formula

(B12)NL/unit=NLC2/unitNL/LC2

The corresponding number of M2-cells in the unit is Display Formula

(B13)NM2/unit=ρM2/LNL/unit

The corresponding number of level-2 memory chips (M2C0-cards) on PWB is Display Formula

(B14)NM2C0/unit=NM2C0/MC2=NM2/unit/NM/MC0

We assume that the two sides of M-PWB are used to mount memory chips; thus, the number of available lots for level-2 memory chips is