Vaster than Empires and More Slow: the Dimensions of Scalability
J. L. Sloan
At the close of the last century, I found myself deeply immersed in the study of large-scale hierarchical storage systems. By large-scale, I mean petabytes, which is on the order of ten to the fifteenth, or millions of gigabytes. Such systems may seem almost mundane today, in a world of gigabyte flash MP-3 players the size of your thumb. But back in 1994, the system that I worked with seemed colossal. The storage hierarchy ranged from tens of thousands of offline tapes, to immense robotic tape libraries, to disk farms, to processor memories. Each of those components was implemented as its own hierarchy of technologies, lending a fractal aspect to the overall architecture. The system was finely tuned in terms of tradeoffs of capacity and speed, and, not being an academic exercise but a production system in daily use by many people, reliability and usability.
I found myself thinking a lot about cache behavior. Most folks think of caches as a piece of silicon in the latest microprocessor chip, or maybe just a number on a data sheet. But caches are all over the place, at all levels in every storage hierarchy, right down to the yellow sticky notes you place around your computer screen to remind you to meet a friend for lunch or what your password is.
I tried, more or less successfully, to model the behavior of the storage system with trace-driven computer simulations of cache behavior, trying to capture, for example, the working set size of the cache formed by the disk farm. As hard as it is to believe, disks, at least of the speed, capacity, and quantity that we needed to keep a half-dozen supercomputers fed so that their applications did not stall waiting for data, were expensive. Finding the right tradeoff for disk farm capacity versus system performance was financially important enough that it made the modeling, which was itself a supercomputer application, worthwhile. I was sufficiently successful in my quest that I published several papers on the topic.
One of the papers I read by a fellow researcher in this area remarked that ultimately, all problems in computer science boiled down to cache behavior. I knew this wasn’t literally the case, but it had a ring of truth in it that I liked. All forms of human endeavor are interconnected. If you study stamp collecting to sufficient depth, you will probably find that most if not all of the history of human civilization and all of its science is encapsulated in the domain of stamps, postage, ink, glue, and the transmission of messages. Or, you could study motorcycles, going back to the invention of the wheel, and even further to the use of logs to roll big stone blocks. Author and television personality James Burke has made a career of observing the connections between technological developments that might otherwise seem unrelated.
But in computing, caches show up with almost alarming frequency for another reason. Caches are a vital weapon in the never-ending war for scalability.
Caches exist because the microprocessor can consume data faster than the memory can deliver it, so it pays to keep the most often used data near at hand. Or in my case, because it takes too long for the robot in the tape library to find the correct tape cartridge, so it pays to keep datasets used frequently by a supercomputer user on disk. Web browsers cache often referenced web pages on a local disk because my spousal unit is too impatient to wait for the network to deliver the page from some far off server.
The differences in performance of microprocessors and memories, supercomputers and tape drives, web servers and spousal units, are all well known by technologists in those problem domains. These dimensions of scalability were painfully familiar to the architects of these systems. But while I studied cache behavior with the goal of improving system performance and maybe even saving money, it occurred to me that there was another dimension of cache behavior that was frequently ignored by system architects in their quest for scalability. This dimension was how the relative performance of different system components changed over time.
I began looking for the first derivatives of the performance of the various system components, for example, how microprocessor speeds changed versus how network bandwidth changed over time. It occurred to me that a system architected for the appropriate balance of performance of its different subsystems today might become sub-optimal a few years down the road.
I collected data from a variety of sources, ranging from Moore’s Law, to the National Science Foundation, to NASA, to industry trade magazines. I ended up with this little table.
Microprocessor speed doubles every 2 years.
Memory density doubles every 1.5 years.
Bus speed doubles every 10 years.
Bus width doubles every 5 years.
Network connectivity doubles every year.
Network bandwidth increases by a factor of 10 every 10 years.
Secondary storage density increases by a factor of 10 every 10 years.
Minimum feature size halves (density doubles) every 7 years.
Die size halfs (density doubles) every 5 years.
Transisters per die doubles every 2 years.
CPU cores per microprocessor chip double every 1.5 years.
What we have here is a set of mostly disparate power curves that illustrate how the performance of major components change with respect to one another over time. And not much time, either. For example, in a decade, microprocessor speed will increase by more than a factor of thirty, while bus speed will only double, and network bandwidth will increase by an order of magnitude. These power curves are illustrated logarithmically in Figure 1. Note that some of curves fall on top of one another, making them a little hard to see.
Admittedly, some of these power curves from my research in 1994 have gotten a little shaky lately as the manufacturers have had to resort to more and more arcane measures to maintain their rate of improvement. Somewhat ironically, many of these measures involve the introduction of yet more caches. But if you just buy into the concept that different technologies are on very different exponential curves of performance improvement, then you are pretty much forced to admit that the balanced system architecture you designed today might not cut the mustard just a few years down the road. Design decisions which made sense at the time for the trade-off of processor speed versus network bandwidth may not seem as wise much later.
In one of my favorite papers, “Software Aging” [Proc. 16th IEEE Int. Conf. on Soft. Egr., May 1994], computer scientist David Parnas describes how software systems bow to entropy just as mechanical systems do, mostly due to the cumulative effects of changes made over the life span of the system, as new features are added, or as the system is adapted to changes in its environment. This is like radiation damage to DNA; eventually, slowly, the cumulative effects become lethal. This temporal dimension of scalability illustrated by the power curves in Figure 1 is an example of the kinds of environmental changes – network bandwidth, faster servers, more remote users – that are likely to occur for any system. Note that the system architect does not have a choice here. Given world enough and time, technology marches on, and failing to adapt a system to the changing climate would be equally lethal.
Lawrence Bernstein was a Bell Labs wonk at a time when Bell Labs was still capable of generating Nobel Prizes, before it too joined the list of sad, neutered corporate research labs. In 1997, he wrote a paper called “Software Investment Strategy” [Bell Labs Tech. J., Summer 1997] in which he made the following observation: improvement in programmer productivity, as measured by the ratio of source lines of code to machine instructions, was also on a sort of power curve, albeit a few twists and turns here and there. The technologies contributing to this improvement in productivity ranged from high level languages early on, to timesharing, and later to object-oriented design and implementation. He predicted large-scale code reuse by 2000. It would be easy to think he missed the mark on this one. But when you stop to consider the impact made by the C++ Standard Template Library (STL), the exploding popularity of design patterns, the industry that has grown up around reusable managed components for Java or Microsoft’s .NET, or even the open source movement, Bernstein may not have been off the mark. His curve of programmer productivity is shown in Figure 2.
This brings us to another dimension of scalability: process, that is, the techniques and tools we use to design, develop, and implement the systems we build. Many years ago, my beloved and frequently exasperating mentor, Bob Dixon, observed that technological development was recursive: you could design more powerful microprocessors because the tools you were using to do so were based on the prior generation of slightly less powerful microprocessors. This creates a kind of positive feedback loop. The same observation could probably be made of mechanical design all the way back to the start of the bronze age, and maybe much earlier. More than one author -- Vernor Vinge and his science fiction and non-fiction writing on the Singularity immediately comes to mind – has made fruitful use of this idea.
The processes we use to build our systems leverage, in part anyway, off those same power curves. Should our processes and tools not keep pace with the technology, we will surely have increasing difficulty grappling with the construction of those systems as they get ever larger and faster. If you are looking for an argument for moving from procedural to object oriented languages, for using libraries like the STL, for trying out an integrated development environment, for upgrading your servers and desktops, or even for replacing an ad hoc development process with a more formal one, this might be it.
Where we apply those processes and tools matters just as much to the scalability of our systems as how we apply them. In his book Object-Oriented and Classical Software Engineering [McGraw-Hill, 2002], Stephen Schach describes the relative proportion of cost for each of the phases of the life-cycle of a software development project. Schach’s numbers are shown in Figure 3. He comes to conclusion that some (but not all) will find startling: 67% of the cost of software is in its maintenance: changes made to the software after the project is deemed complete. Some organizations with experience in maintaining large code bases, and by large here I mean millions of lines of code, place this number closer to 70% or even higher.
This harkens back to Parnas’ idea of software entropy. Software systems become more and more expensive to modify over time due to the cumulative effect of changes. Then it is no surprise that the bulk of the cost of developing software is in making these changes. This is as much a limit to scalability as processor speed or network bandwidth. Increases in efficiencies in tools and processes must be applied not just to new code development, but to long-term code maintenance as well. This is an area where code refactoring – the ability to substantially improve the design of code, including improving its ability to be maintained, without altering its external behavior – will continue to play a major role. Likewise, this calls for more thought into designing new code to be easy to modify, since the effort spent in changing code after the fact is the bulk of the cost of software development. In my experience, this issue is largely ignored among software developers except among the refactoring proponents. Most developers -- and truth be told their managers as well, if widely used development processes are any indication – are happy if code passes unit testing and makes it into the code base anywhere near the delivery deadline.
I find that this long term cost is seldom taken into account, and its omission arises in sometimes surprising ways. I once heard a presentation from a software development outsourcing company. It happened to be based in India, but I am sure that there are plenty of home-grown culprits too. The company described several cost estimation techniques used by their ISO 9001 certified process which was assessed at SEI CMM Level 5 and used Six Sigma methodology. None of the cost estimation techniques addressed the cost of long term code maintenance. Code maintenance after delivery of the product was charged per hour.
I almost leaped from my chair, not because I was angry, but to go found a software development company based on this very business model. The idea of low-balling the initial estimate then making a killing on the 67% of the software life-cycle cost pie was a compelling one. Only two things stopped me. First, I had already founded a software development company. And second, I had read a similar suggestion made by Dogbert in a recent Dilbert cartoon strip, so I knew that everyone else already had the same idea. It is as if once we deliver a line of code to the base, we think that the investment in that code is over. In fact, Schach tells us it has just begun. Every single line of code added to a code base contributes to an ever increasing total cost of code ownership.
There are more dimensions to system scalability than just balancing the performance of the various components. You must take into account how the relative performance of those components change over time. You must apply scalable processes to the development of the system. And you must consider the long-term maintenance of the code. Failure to take any of these issues into account limits the scalability of your system just as surely as if you had designed it around obsolete technology.
(The author would like to acknowledge a debt to Andrew Marvell, whose poem “To His Coy Mistress” inspired this article, and which is probably the best pick-up line of the 17th or any other century.)