For most technology companies, products are their lifeblood. A lot goes into making this happen, but if one steps back and looks at the process, the mathematics that is perhaps most applicable to a company as a product-generation-machine is queueing theory. In my mind, queueing theory holds the key to a lot of what we do right and a lot of what we do wrong as managers and workers.
Queueing theory was born when Napoleon asked Poisson how many cannons he would have to send to the front to have 80% confidence in getting 100 canons to where he wanted them, when he wanted them. The most readable book on the production impact of queueing is The Goal, by Eliyahu Goldratt. One of the key points Goldratt makes is that to get control of and minimize the cycle time of a process, one should put the bottlenecks at the beginning, if possible. A bottleneck is a task that takes a long time, uses significant scarce resources, and has potential large fluctuations in latency. For semiconductor processing, one finds most of the bottlenecks in lithography—for instance, resolution enhancement technology-based decoration, reticle writing, reticle inspection, and wafer exposure. The bottleneck processes in any system are usually obvious from the larger queues held up in front of them. For engineering processes, the most unpredictable part of the process is developing the science or basic technology. If we want our development times to be fast and predictable we must, quite simply, get the basic technology done before the engineering is started. If one ends up solving basic technological issues in product development, then development times will stretch unpredictably. One simply must confront all of the hard technical questions early in the process. Sometimes we are tempted to do the engineering “in parallel” with the science. Even if one were smart enough to plan to line up the expected durations of these activities, the huge variance in the duration required to do the science would make this an extremely inefficient process. There is no shortage of examples of this in multiple industries. Sometimes we act as if we have a fear of data and don’t, with single-mindedness of purpose, drive to prove or disprove our mastery of the basic technology over the range of expected use (and then some). This is a mistake. The astute product manager or technology manager will drive to get these basic questions answered early and, hopefully, inexpensively. It is relatively easy to make and report impressive-looking progress on peripheral issues, but the question is, are we resolutely solving the core risk issues? Paranoia is good.
From a factory perspective, one reason for wanting the bottleneck at the beginning of the line is because when material inevitably queues up before the bottleneck step it will be of lower value inventory since we have not gone through the expense of adding value to it through the manufacturing process. For an engineering program the situation is analogous. If there is going to be an uncontrolled delay, one wants it at the beginning (before the phase gate that marks the start of product development) before large engineering teams have invested in detailing a design, even if that design does not ultimately change as a result of new understanding.
The other reason for the factory to have the bottleneck at the beginning is that subsequent steps can be carried out with low latency (fast cycle time). Variation is the curse of manufacturing. In engineering we want to strive for a similar situation. We want to set the “final” market requirements and speedily and predictably execute the design before the market changes. Predictability also creates a better experience for our customers.
Little’s Law is a delightfully obvious and non-obvious theorem. (The non-obvious part is that it is independent of probability distributions and queuing service algorithms. It was first proved in 1961 by John Little.)
Little’s Law says that, in the steady state, the average number of customers or items or tasks or whatever in a stable system (over some time interval) is equal to their average arrival rate, multiplied by their average time in the system. The only ways to lower the number of items in the queue is to either slow their average arrival rate or, on average, get them out of the system faster. As more people per hour enter the store, the number of people in the store increases.
Instead of arrival rate, it is equally valid to talk about completion rate. So, for instance, one can think of how many widgets a machine can produce per hour or how many tasks an engineer can complete per week. If you then pile on widget WIP (work in progress) or engineering tasks, then Little’s Law says that average time in the system (cycle time) is the amount of WIP/tasks/etc., divided by the average completion rate. We see this all the time, too. We pile someone up with tasks and they become unresponsive. It’s math. It’s the math behind one of my favorite expressions, “Part-time people round to zero.” More people enter the store and the cash register lines get longer. This is why “focus” is one of the magic elixirs of business. If you want your operation to become more responsive, you either have to work faster or find a way to get people to pile fewer things in your queue. In another example, the more products an engineering team (with a fixed completion capacity) works on, the longer the projects take on average. Sound familiar? This is one of the reasons that one of the most important strategic decisions a general manager or company CEO can make is what NOT to work on. It is perfectly legitimate and appropriate to ask the business question, “What projects should I kill (redeploying the resources) in order to speed my time-to-market?” Because competitive differentiation is a function of time, speeding the time to market will increase profitability even if you never save a dollar by killing the other project, though one can usually optimize better than that. Another version of this is not moving resources off the last project and onto the next (adding another project to the queue) until the last one is really done (out of the queue—not being worked on anymore).
Fluctuations, uncertainty, and variation add another dimension to the problem. Inevitably, whatever the plan, s–t happens. There is a very old saying, “If you want to make God laugh, make a plan.” So there will be deviations from the ideal. The clearest explanation I have seen of the effects of fluctuation on engineering is in Chapter 3 and Appendix 1 of Fast Innovation, by George, Works, and Watson-Hemphill. In Appendix 1, the authors work through the equations on the effects of variations, of cross training, and of re-use on engineering project cycle time.
Let us define C as the normalized variance, that is, the standard deviation divided by the mean (average). Typical values of C are <0.1 for manufacturing tasks, 0.5 for (from scratch) engineering tasks, and <0.2 for engineering tasks with significant re-use. Note how large the variance is for engineering tasks. There are a lot of data behind this. One can only speculate on how much larger is the normalized variance on basic technology development tasks.
Let R be the average utilization between 0 and 100% and N be the number of cross-trained resources, where N=0 is no cross-training. In this case the number of items or tasks in the queue is given by
Queue size = [R2/(1-R)] [1/(N+1)] C2
Using Little’s Law we have
Cycle time = [R2/(1-R)] [1/(N+1)] C2 [1/(completion rate)]
Let’s go through the terms in order. The first one is a killer. It says that as we drive utilization, R, toward 100%, the number of tasks in the queue and the cycle time both explode. We see this in everyday life on the highway—we try to run the system in rush hour close to capacity and, traffic grinds to a near standstill. The same thing is true in engineering projects. Schedule a critical path resource at 95% capacity in the name of efficiency and all projects going through that bottleneck grind to a halt. Schedule at 65% of capacity and the inevitable variations that occur can be accommodated by the extra slack and tasks do not pile up. This is why many organizations do not try to schedule critical path engineers at more than four-days per week. When the crunch comes, there are another three days per week to catch up without a pile up. At many companies it is often macho to over-schedule and over-suffer. As businesses mature, they become more and more dependent on deep competencies held by a few people (the bottlenecks) in the company; how does one schedule these folks? How will one avoid piling on tasks and scheduling them to 100%, or more? If you do not have the right discipline, the math is clear about the inevitable result.
The second term depends on the amount of cross-training. As every director of Operations knows, one of the more effective ways to lower cycle time is to increase cross-training. For engineering, even if someone is trained only on a subset of the competencies of the person who is the bottleneck, it can make a big difference. This is also one of the reasons I am a great fan of standardizing design tools and processes across a company. With standardization, the number of people who can be drawn on to help in a crisis is increased. Many managers, including myself, have had the pain of running into a bottleneck in some software area and discovering that you cannot redeploy other resources to help because the offending group had decided to work in some obscure “high-level” programming language or paradigm. All the efficiency arguments had seemed persuasive in a world without variance, without crises. But not in the world I live in.
The third term is the square of the normalized variance. One can make this much worse, as we have discussed, by doing science and engineering concurrently. One of the few ways to make it better is through re-use, which has fewer surprises than from-scratch design. A variant on this is to use a team that has done a very similar project before, i.e., re-use of expertise and experience.
The last term is related to how fast one can work. The ways to improve this term are through the use of more powerful tools, through teams that have done it before, by having clear specifications and keeping them constant, and by having smart, motivated, hardworking, and talented people.
As you can see, good management involves a number of non-intuitive activities, including
- Killing projects and limiting the number that are started in order to improve competitive differentiation and time-to-market
- Putting the preponderance of the effort into the core risk items early in the project even though that will result in less visible progress
- Not tightly scheduling expensive and scarce resources so as to avoid cycle-time disasters
- Not doing some types of work in parallel, such as engineering and technology development, even though it might improve the time-to-market of that product
Great managers have the guts and discipline to do the right thing and do it today. Of course I have used idealized models and sweeping generalizations here. There are always exceptions, but as companies try to ascend from good to great, it would be great if we all had the discipline and insight to manage our product development with a well-developed understanding of the implications of our actions.