1. Every time an advertisement is shown on a website, this event is counted as an im‐
pression. The concept is important in the advertising industry, since advertisers often
buy a certain number of such impressions.
CHAPTER 1
Why Feedback? An Invitation
Workflow, order processing, ad delivery, supply chain management—
enterprise systems are often built to maintain the flow of certain items
through various processing steps. For instance, at a well-known online
retailer, one of our systems was responsible for managing the flow of
packages through the facilities. Our primary control mechanism was
the number of pending orders we would release to the warehouses at
any one time. Over time, these orders would turn into shipments and
be ready to be loaded onto trucks. The big problem was to throttle the
flow of pending orders just right so that the warehouses were never
idle, but without overflowing them (quite literally) either.
Later I encountered exactly the same problem, but in an entirely dif‐
ferent context, at a large publisher of Internet display ads. In this case,
the flow consisted of ad impressions. 1 Again, the primary “knob” that
we could adjust was the number of ads released to the web servers, but
the constraint was a different one. Overflowing the servers was not a
concern, but it was essential to achieve an even delivery of ads from
various campaigns over the course of the month. Because the intensity
of web traffic changes from hour to hour and from day to day, we were
constantly struggling to accomplish this goal.
As these two examples demonstrate, maintaining an even flow of items
or work units, while neither overwhelming nor starving downstream
processing steps, is a common objective when building enterprise sys‐
3tems. However, the changes and uncertainties that are present in all
real-world processes frequently make it difficult, if not impossible, to
achieve this goal. Conveyors run slower than expected and web traffic
suddenly spikes, disrupting all carefully made plans. To succeed, we
therefore require systems that can detect changes in the environment
and respond to them.
In this book, we will study a particular strategy that has proven its
effectiveness many times in all forms of engineering, but that has rarely
been exploited in software development: feedback control. The essen‐
tial ingredient is that we base the operations of our system specifically
on the system’s output, rather than on other, more general environ‐
mental factors. (For example, instead of monitoring the ups and downs
of web traffic directly, we will base our delivery plan only on the actual
rate at which ads are being served.) By taking the actual output into
account (that’s what “feedback” means), we establish a firm and reliable
control over the system’s behavior. At the same time, feedback intro‐
duces complexity and the risk of instability, which occurs when inap‐
propriate control actions reinforce each other, and much of our at‐
tention will be devoted to techniques that prevent this problem. Once
properly implemented, however, feedback control leads to systems
that exhibit reliable behavior, even when subject to uncertainty and
change.
A Hands-On Example
As we have seen, flow control is a common objective in enterprise
systems. Unfortunately, things often seem rigged to make this objec‐
tive difficult to attain. Here is a typical scenario (see Figure 1-1).
1. We are in charge of a system that releases items to a downstream
processing step.
2. The downstream system maintains a buffer of items.
3. At each time step, the downstream system completes work on
some number of items from its buffer. Completed items are re‐
moved from the buffer (and presumably kicked down to the next
processing step).
4. We cannot put items directly into the downstream buffer. Instead,
we can only release items into a “ready pool,” from which they will
eventually transfer into the downstream buffer.
4 | Chapter 1: Why Feedback? An Invitation5. Once we have placed items into the ready pool, we can no longer
influence their fate: they will move into the downstream buffer
owing to factors beyond our control.
6. The number of items that are completed by the downstream sys‐
tem (step 3) or that move from the ready pool to the downstream
buffer (step 5) fluctuates randomly.
7. At each time step, we need to decide how many items to release
into the ready pool in order to keep the downstream buffer filled
without overflowing it. In fact, the owners of the downstream
system would like us to keep the number of items in their buffer
constant at all times.
Figure 1-1. Block diagram of a workflow system. Items are being re‐
leased into the “ready pool,” from which they are transferred to the
downstream buffer.
It is somewhat natural at this point to say: this is unfair! We are sup‐
posed to control a quantity (the number of units in the downstream
buffer) that we can’t even manipulate directly. How are we supposed
to do that—in particular, given that the downstream people can’t even
keep constant the number of items they complete at each time step?
Unfortunately, life isn’t always fair.
Hoping for the Best
What are we to do? One way of approaching this problem is to realize
that, in the steady state, the number of units flowing into the buffer
must equal the number of units flowing out. We can therefore measure
the average number of units leaving the buffer at each time step and
then make sure we release the same number of units into the ready
pool. In the long run, things should just work out. Right?
Figure 1-2 (top) shows what happens when we do this. The number
of units in the buffer (the queue length) fluctuates wildly—sometimes
exceeding 100 units and other times dropping down to zero. If the
space in the buffer is limited (which may well be the case if we are
dealing with a physical processing plant), then we may frequently be
Hoping for the Best | 5overflowing the buffer. Even so, we cannot even always keep the
downstream guys busy, since at times we can’t prevent the buffer from
running empty. But things may turn out even worse. Recall that we
had to measure the rate at which the downstream system is completing
orders. In Figure 1-3 (bottom) we see what happens to the buffer length
if we underestimate the outflow rate by as little as 2 percent: We keep
pushing more items downstream than can be processed, and it doesn’t
take long before the queue length “explodes.” If you get paged every
time this happens, finding a better solution becomes a priority.