Dec 14, 2021 5 min read

When the Tail Wags the Dog

S3: "agile"

I first joined AWS in 2010, working on S3. These were still very early days at AWS. We had only a handful of services—S3, SQS, SimpleDB, EBS, EC2, EMR, CloudWatch, ELB, and SNS. CloudFront and Route 53 were on the horizon. The notion of a VPC was still just a sketch in a patent.

While these were still humble beginnings, AWS was growing. S3 was the first service launched and thus the most mature, with dedicated teams for ops and each of the many different layers of the service. At the time, my part of the org focused on two products: the Import/Export service (now Snowball), and the AWS Management Console for S3. Each had their own little team of ~3 people for product development, and shared an oncall rotation.

With just three people writing code, we didn't need a lot of ceremony. PMs and EMs helped us decide which projects to work on, based on what came out of the OP1 exercise.

We'd do some planning, figure out ownership areas, kick off, then meet for a few minutes every morning to make sure we were aligned.

We'd make sure to check for pending code reviews at least once per day, or as requested. (Our use of Perforce made branching/merging an expensive operation, and changelists weren't especially useful for managing multiple overlapping changes unless you managed multiple client roots, so code review ended up being our main bottleneck.)

If you had a question, you either looked over the pony wall of your cubicle and asked, or typed it out in IRC.

We'd typically release every couple weeks after sending our changes through Beta and Gamma environments for testing and baking. We'd get direct customer feedback and requests for help from the AWS Forums.

We had a lot of the dressings of "agile development"—gathering info, defining and documenting projects, prioritizing a backlog, daily standup, periodic delivery—but we didn't really call it that. It's just how we worked.

Project D: "Agile"

When I joined the highly secretive Project D—later known as Doppler, and finally released as Echo and Alexa—I learned that not every org in Amazon built software the same way.

Project D involved hardware and firmware. Hardware and firmware meant hard drop-dates and slow iteration times. And wrapped around these artifacts were dozens of new systems—ASR, intent recognition, TTS, remote device management, V1 app integrations for things like music and weather, an activity feed service—all being built simultaneously to power that ultimate litmus test of enjoyability: "Alexa, play songs by Sting."

For Project D, PMs and TPMs ran the show. And they were Agile with a capital "A". They defined epics. They wrote stories. They groomed backlogs. They managed releases. Engineers showed up to sprint planning, pulled in so many points worth of work, and did the needful. When I left, about a year before launch, we had over 200 people working on the project, all moving in a beautifully choreographed Agile dance.

And it worked. It wasn't always perfect, but it worked. It worked because even though we adopted a process, that process served us in our goal of achieving very specific desired outcomes that laddered up to a well-defined product definition.

Avoiding the Process Trap

So here we have two examples: one where a small team builds software in tight iterations, with little process and a minimal amount of upfront planning; and another where a much larger team formalizes these "agile" activities in such a way that they maintain the lightweight cycles but can scale and orchestrate them across many teams in a top-down manner.

Where teams face danger is transitioning between these two states. As a team grows from 3 to 5, 10, 20, 40 people, communication becomes harder and harder. We manage this chaos by breaking the larger team into smaller, autonomous units, each responsible for a slice of the overall product. The theory is that the smaller teams on the ground benefit from lower communication overhead for fast-turn work.

But there's a common trap that teams fall into: they defer and delegate the integration of their activities into the whole.

If there's no one to catch that delegation, the result is abdication. And even if someone catches it, over time the work of integrating outputs into a coherent whole it becomes too much for any one person (or even team) to manage.

The feedback cycles are too slow. The work starts to drift in different directions. Teams let the process dictate the product—letting the tail wag the dog. The resulting product becomes incoherent. And then it all falls down.

Unfortunately, even when you see this happening within your team, the cause is not obvious, because it happened before any of the symptoms of dysfunction surfaced: the teams were not organized around intentional, focused outcomes that they could own end-to-end. Instead, teams end up focused around the outputs and artifacts—the features and systems that they built.

Sometimes there's a strong affinity between the system a team owns and the needs of a customer, but more frequently the needs of a customer span many systems. When this is true, and if you've oriented teams around systems, no single team understands the end-to-end journey of the customer, or owns the key outcomes the product is trying to solve for the customer as they take that journey.

A large, mature organization can mitigate this problem at great expense by throwing more people at it to help organize the chaos: researchers who provide an integrated picture of the customer journey; PMs who drop into a more tactical mode of working, prescribing features instead of operationalizing a strategic direction through the definition of key metrics; longer development times to course-correct things late in the development cycle. Gradually, they pull together the pieces into something coherent.

But for small teams working on nascent products with limited resources, this lack of org-market integration is an existential risk. When your runway is limited, the only thing that matters is making sure you're continuously realigning the resources of your org against the most urgent needs of your customer.

When you're small, the best way to do this is to orient your teams around outcomes that they can own end-to-end. The tradeoff you make is giving up the natural tendency for internal coherence (architecture, scalability) to achieve external coherence (a good product). But you can mitigate this in other ways (e.g., through shared architectural principles that all agree to; by staffing a team whose customer is other internal teams, and charter goal is to maintain that internal coherence; through architecture reviews and brown bags).

In fact, this is exactly what AWS did. My little team within S3 was part of a much larger whole, but had goals it could execute on end-to-end with just three people. This is how AWS scales in general: by building services staffed by service teams. "Service" in this context means the end-to-end value you provide your customer. Sometimes that customer is an actual paying customer, and other times it's another internal team. But that customer orientation is key. Goaling on outcomes is key. The careful division of teams to enable end-to-end execution against those goals is key.

Somewhere along the way, many engineers took the word "service" and changed it from something that describes the posture of a team toward a customer and their problem, to something that describes the runtime characteristics of a system. It changed from outward-facing to inward-facing. This is wrong. If you find yourself thinking of services in these terms, stop. It's a trap.

Look outward. Focus on outcomes. Let your desired outcomes define your process, and your process reinforce your desired outcomes. Don't let the tail wag the dog.