Scaling Engineering Teams With Distributed Ownership

After 20 years leading engineering, product, and as CEO, and nearly 15 years working in the bay area with the best engineering leaders in the world, I’ve come to appreciate how quickly management culture has evolved in the past decade, and how little has been written about the operating details of practical modern engineering management. This post tries to shed some light on basic roles, responsibilities, and practices on a modern engineering team with distributed responsibilities, and is hopefully especially helpful to all those right now learning how to work in a more distributed and remote setting during the COVID19 crisis.  The content has benefited from review and input from more than ten of the best folks I’ve encountered over the years who currently own large chunks or all of engineering at places like Google, Facebook, Linkedin, Twitter, and fast growing startups, all thanked at the end of the post.

Among the most important insights that’s become clear is that teams are happy when they are winning, have a clear mission, and team members are growing individually. There’s no long term happiness without winning, and the most important role of an engineering leader is to enable their team to win. Distributed responsibility is one of the most powerful mechanisms to facilitate both winning and personal growth.

Engineering teams fall apart as they scale past 10 people without some form of organization, and distributed responsibility scales better than centralization — it creates space for many strong leaders on the team, attracts and retains higher caliber engineers who seek more ownership, maximizes personal growth, and guides each team member to the highest impact work.

Running teams with distributed responsibility is becoming more well understood. The distinction between the Engineering Manager and Tech Lead is pretty well established now, and we have seen that traditional Agile and Scrum planning has aspects that are more centralized, such as central backlog grooming, which aren’t congruent with distributed ownership of larger chunks of work. Most high performing teams use some common patterns like an Approval Matrix with Round Robin rotation for responsibilities such as code reviews or release shepherd.

Three Ingredients for happy teams

You can build happy teams with three ingredients; Mission, Winning, and Growth.

Mission; people can only truly be happy when they are working on a mission they believe in. It doesn’t need to be overtly humanitarian — you can be excited about building tools to help teams be more productive.

Winning; People need to understand the scoreboard for the company, their team, and themselves. In ‘So Good They Can’t Ignore You,’ Author Cal Newport makes the point that people are happy when they’re successful, and they’re successful when they are applying their strongest skills, getting winning results, and being recognized for their impact. Great managers help each team member find their strongest skills, and apply those to the KPIs that matter in order for the team to win.

Growth; When a team is successful along a path towards a mission they believe in, people want to grow their role and have more impact. Great managers understand the career goals and growth path for every team member, and they are always jointly optimizing for the team’s success and the individual’s personal growth. In doing so, they ‘manage to the level’ — for more junior and mid level folks, that’s helping each team member find the work that maximizes their impact, and for more senior folks that’s providing the framework for them to identify opportunities themselves based on context like team KPIs and the manager’s short term problems and priorities.

How teams fall apart as they scale

Teams can function well at less than 5-10 people. There’s no need for subteams or team splits, clear roles and responsibilities like engineering manager and tech lead, and it’s easy to have team alignment on the right way to do things. Sometimes this breaks down at 5 engineers, sometimes at 10, but it rarely makes it past that point before descending into chaos without some kind of more formal roles and responsibilities. The more the team is remote and geographically distributed, the faster this breakdown will come.

Engineering organizations fall apart as they become bigger and more complex. As you go from 10-20, it gets harder to have just one big team — basic things like planning or standups become unwieldy. Centralized planning reduces both winning and personal growth — it disempowers the team, slows things down, detaches the team from how their work maps to business KPIs, stunts career growth, and wrecks the positive vibes. Full bottom-up decentralization usually slows things down, creates conflict, and results in chaos. Organized decentralization creates a few clear roles and responsibilities, and empowers the team to rotate ownership of everything important.

Engineering teams also often lower the bar as they scale instead of raising it. The antipattern I see a lot is hiring too fast based on headcount targets instead of productivity targets. Instead, be output based and embrace outliers. The narrative that ‘you need different people at different stages’ is used to justify making easier, weaker hires with the hidden thought that the earlier folks will have more power and prestige, and the army of minions will just take their brilliant direction. It’s better to focus on overall team output per person, and invest more in finding and recruiting special hyperproductive, highly autonomous and knowledgeable people for the team. They will bring the average up and push the early team, rather than looking for direction and cutting the early team’s bandwidth. A very important corollary here; only allow teams to hire that are demonstrating output. A frequent anti-pattern is teams who can’t deliver for various reasons blaming it on being short handed, and then hiring more people into a dysfunctional team, which can often decrease rather than increase output, the well established ‘mythical man month.’ Teams should be bursting at the seams *and* have stellar performance before you begin hiring new folks into the team — don’t allow teams to ‘blame it on lack of headcount.’ Hold teams accountable to their output expectations at their current scale before allowing additional hiring. It’s possible there’s a misalignment of expectations, and then its on the team to push back and reset expectations based on solid reasoning and data — for example maybe the manager doesn’t see that one of the reasons a backend service is a bit funky and moving slower is that the team needs a more senior engineer with specific skills, and a few other engineers are pulling double-duty trying to design a complex part of the system they aren’t equipped to tackle yet.

Another common fail pattern is not applying technical aggressiveness in a practical way; either in not seeing and investing enough where the core IP really is, or in overengineering something that should have been a quick hack. It’s very common that teams burn a lot of calories overengineering things very badly, apply technical aggressiveness to trying new framework flavors of the day, or laying out a complex architecture with many layers of useless indirection, but of course that’s ‘future proof’ in some evil ways. This is one of the most important roles of design and code reviews, and the style guide should include clear rules about code simplicity and cutoff points where the code veers into over-engineering. It’s important not to be complacent when delegating — delegation needs to come with governance. When delegating ownership of subsystems to someone that may not have the experience to find the right balance, you can easily be surprised with output that’s overly naive or overly complex. Design reviews, especially for core systems and large projects, should go through sanity check or second level review by someone trusted to have the right experience and insight. On the other hand, teams often have missed investing enough time being awesome in their core competency and core IP — usually because it’s treated as just another chunk of work and the over-engineering projects draw time and focus away from improving things where the core value is and potentially even adding new possibilities for product. That’s why it’s important to prioritize core IP engineering possibilities in the backlog, and always compare it to new functionality or design changes that might seem more digestible but less valuable. This is like Geoffrey Moore’s idea of “Core vs Context” applied to technical work.

A culture of distributed ownership sets up winning and growth

Among the most important responsibilities of the Engineering Manager are the constant joint optimization for team success and individual team member personal growth. Managers seek to distribute work as widely as possible and aligned with individual skills for highest impact and focus on areas of growth — optimizing team results based on individual impact and growth. They empower teams with clear goals, and separate direction from execution to allow for ‘managing to the level’ — enough direction for the right level of each person to own their own execution and as much of the direction as they can while delivering success. Teams might have a slow moving KPI, but managers might let their individual team members propose and operate on their own faster moving KPIS that they believe will be leading indicators of the slower moving team KPI.

There are some additional structural factors that will help the team distribute ownership for maximum career growth and positive vibes. First, you need the right mix of seniors, mid, junior.  Too bottom heavy and your senior engineers spend all their time mentoring instead of building. Too top heavy and there’s not enough space for leaders and people get frustrated with one another or check out. This is a tough balancing act based on people, domain, stage of the company and especially as your junior engineers progress — and after all, we’re saying one of the pillars of happiness is growth, so what happens to team balance as we grow the juniors to mids and mids to seniors? Your max max bottom heavy should be something like 10%, 30%, 60%, and max top heavy probably the inverse. The simpler the work and the context, the more you can bias toward the bottom heavy ratios and larger teams, and the more difficult, specialized, and complicated the context, the more you want to consider smaller and more senior teams. Many other factors come into play here, for instance the need for speedy hiring and growth of the organization. If the company and teams are growing fast, then engineers can grow with the company and find new opportunities, and they’ll tend not to outgrow the setting and get frustrated with each other fighting over space for ownership, and since we need to hire faster, the whole setup makes sense for a bit larger teams and more juniors as long as we have enough seniors to mentor. We’ll be able to hire faster, train folks up in time, and have plenty of opportunities for them to grow into. On the other hand if the organization isn’t growing fast, then we need to be careful of becoming too top heavy and leaving not enough oxygen in the room for leaders, so we may slow hiring way down and focus on smaller teams with more senior folks owning larger chunks.

It’s worth having basic levels, parallel track for engineering manager and tech lead, and investing in an annual or biannual growth plan and review. Lastly, the manager must know when to protect vs eject; protect a team member from some externalities to create space for clear focus, vs eject a team member into the messy outside world of a client interaction or production environment to drive ownership.

Responsibilities are distributed as widely as possible to the levels and skills that can manage them. For example every engineer should be able to review code and maintain a component of the production stack including deploy safely with confidence and owning an on-call rotation. This should help force out sufficient testing, process, automation, and monitoring such that the engineer doesn’t worry too much about blowing things up. Always recommend having engineers own production (or at least part of it) for quite a long time before separating out technical ops, because it enforces a lot of upstream sanity when the engineers need to own their own work in prod. When a new level one junior person joins the team, we should expect them to be able to review code and deploy within some number of weeks or months depending on the complexity of the project. On the other hand we don’t necessarily need to expect every engineer to interact well with customers, or lead technical design for a major component of the system — there are specialist skills, and skills we don’t expect from people until they become more senior.

The person doing the work owns the execution, even though they may not own the direction. Rather than top down execution driven by meetings, we rely on bottom up execution based on top down direction and review. A more centralized culture has management meetings top-down, and managers build execution plans in meetings with their bosses, then break them down into execution plans with their teams. A more decentralized culture doesn’t necessarily mean bottom-up direction. Total top down is command and control, total bottom up is unmitigated chaos. It’s good to strike a balance with management meetings about direction, then supply the direction to the teams and empower them to build their execution plans themselves, which managers can review async or in meetings. Delegation of authority becomes more important as the team grows into the double digits; at that size you can’t just delegate task management, you need to spread the actual authority to make important decisions out to the broadest set of nodes that have the skills to do so.

https://youtube.com/watch?v=l_Z5Htvg99U%3Ffeature%3Doembed

[ping for slides]

Modern engineering roles and responsibilities, and why Scrum is awful

In a modern setting, the engineering manager owns team execution results and career growth, and they are always simultaneously optimizing for both — they are always trying to get engineers to own bigger chunks of work that they can break down themselves, eventually even doing so for other engineers as they grow into team leaders. There’s no notion of a single groomed backlog that everyone can just take the next ticket out of, because unlike scrum the engineering manager is embracing different kinds of work at different levels for different engineers — managing to the level vs assuming everyone is at a junior level and needs detailed tasks queued up for them. The downside of “just take the next ticket” is robbing engineers of the learning opportunity to structure and decompose the work themselves.

For the past decade, there’s been a lot of work on separating the engineering manager from the tech lead role as opposed to what has previously been called the ‘TLM’ or tech lead manager, which holds both responsibilities and usually does one or both of them poorly as a result. A TLM role is fine and makes sense when the team is very small and a separate tech lead and engineering manager just feels like too much leadership for a small team. If you have a senior engineer who wants to start managing and can take on a few reports, it can work well. However sometimes you have a TLM at scale on a larger team, or TLM as a transition-to-management role, which in both cases can leave the individual responsible for too much on an absolute basis or relative to their training as a manager.

As the team scales through, it’s important that someone is responsible for organizing the team, process, and ensuring people are happy, productive, and growing, and someone else is responsible for the technical direction and systems of responsibility like code reviews, deploys, and so on. If the team gets too large without splitting these roles, it invariably causes one or another area to dry out from lack of focus and cause management or technical issues on the team — it’s just too much work for one person in teams larger than 5-10, and dedicated tech leads become needed, even though the engineering manager may still be quite involved technically.

Connected to the engineering manager – tech lead split is the notion of growing engineer’s careers into tech lead and engineering manager roles. The tech industry sometimes try to push great ICs into these roles where it may be better to leave them as great ICs — some folks have a multiplying impact without needing or wanting to lead a team more formally as a tech lead or manager. Managers are responsible for spotting rising mid-level talent and growing them into technical or management leaders. Both future tech leads and engineering managers show signs of wanting broader impact by shaping the work for people on the team rather than just doing their own work. Those headed for a tech lead path will want to impact the technical direction, build process, code reviews, deploy infrastructure, and so on. Those headed for the management path will tend to take on sub-team lead roles, taking on some of the engineering manager responsibilities for smaller teams such as owning the planning process, and usually starting to take on 1:1s for one or two direct reports. Progressing engineers through sub-team leads and more technical seniority into engineering managers and tech leads is really the lifeblood of a healthy engineering culture. If you’re able to grow your mid level team into leaders, you’re on the right track to build an organization with an awesome career growth engine, and that’s one of the most important keys to happiness (Mission, Winning, Growing), and the one that’s most under control of the engineering organization.

Lastly, here are my ideas about planning — you’ll see a lot align with classical agile ideas, but are very different from Scrum. First, the team owns their own decomposition, estimation, and tickets. There isn’t an elaborate centralized ‘grooming’ process or ridiculous estimation games like ‘planning poker.’  The basic agile principles and planning ideas are pretty sane, and they were a valid reaction to the crazy long specs and rigid processes that came from the 80s and 90s. However, these practices evolved a lot in consulting shops that didn’t have ownership of results, and also had a staffing model for engagements with a large # of junior folks on projects relative to a small number of senior folks. When you can design your team differently and everyone is empowered to work on what has the highest impact on the team’s outcome, then you can design your organization and processes differently. Additionally, the ‘Scrum’ side of agile became about selling certifications and creating deep confusion about roles and responsibilities. To clear things up; the ‘product owner’ is just the PM and the ‘scrum master’ is just a set of planning responsibilities owned by the engineering manager — in all the best engineering cultures I’ve seen for the past ten years, we don’t really see separate project managers or folks outside engineering owning the work queues.

In what I consider to be a more modern and decentralized approach, Engineering Managers work with engineers, who take ownership of chunks of work, decompose them into tasks, estimate, and get the team and manager to review async or in a planning meeting.  Everything is tied to the team KPIs, which are tied to the company KPIs, but we don’t need any fancy formats like user stories, etc. Engineers are even more empowered to think about customers, product, and business value in this model — they are looking at how they can impact team KPIs, and empowered to push managers for the work they think is most impactful on the KPIs. The tech lead owns the technical queue — tech debt, testing issues, infrastructure, etc. 

The engineering manager is accountable for the team’s execution results, so they facilitate planning debates and own the negotiation with the team about final priorities, which helps avoid analysis paralysis and decision by committee. They also own dealing with the systemic stuff that impacts results — for example, engineers are notoriously bad at estimates,so engineering managers own pushing back on estimates and training the team to get better at looking at the full context of reality; testing, bugs, production, scope changes, learning a new part of the code, etc. As teams scale and you have managers-of-managers, it’s important that the more senior managers facilitate the complex cross-team coordination for major efforts such as paying down large swaths of tech debt, large migrations and re-architectures, dependency management.

Scale through distributed responsibilities not centralized overlords

We’ve all seen or been the Code Review Overlord. Even when unwillingly stumbling into the role, the pattern is always self-reinforcing; one person knows too much or has too much initiative and it’s therefore faster to get them to review everything. The rest of the team always says they’ll try to do more code reviews, some people even try, but the bottleneck just gets worse until something blows up.

Similarly to the Code Review Overlord doing all the code reviews, The Bug Fixer is the one who fixes all the bugs. Usually the system is set up to ‘reward heroes’ and therefore positively self-reinforce The Bug Fixer to dive into action as if they are the hero of their own spy movie. They’ve accumulated the most knowledge on the team of how everything in the system actually works in production, and they have great debugging skills in both live systems and locally. Unfortunately this system understanding and debugging skill is critical but expensive to distribute around the team, because there’s a lot more context in finding and fixing what’s wrong in a real running system then there is to just reading and roughly understanding some code.

Teams should own components, not individual engineers. Sometimes team leaders think it’s best for a different person on the team to own different components of the system. In some cases it may make sense to own totally different parts of the system; such as the iOS app, the web services, a python machine learning service, or the deployment infrastructure. However, it almost never makes sense to own different components within those parts of the system. So let’s say for example you have a team of folks working on an iOS app, then the entire team should understand the client and services, it shouldn’t be that one engineer owns a component like the user profile, and another engineer owns a bunch of analytics screens; it creates context silos and weak design (since it mostly just needs to be understood by one person), and it makes the team less flexible in moving to work on the most important problems this week. Like many things in management, this isn’t absolute — you might find ‘system cut points’ where you can have natural higher performing sub-teams by focusing context on part of the codebase, either by engineering function, app, service, or even team location.

The n00b Does the Thing Pattern

Like the Code Review Overlord, Deploy Tzar, or Bug Fixer, managers easily fall into the rut of having their go-to person. Whenever there’s a problem, the manager can count on Captain Go-To to provide a trustworthy and timely fix. That behavior pays off in the short term, but longer term it robs opportunities from the rest of the team and you end up with a weaker team overall. It pays to invest the time to make everyone on your team a go-to.

One simple way to address this head on is to pick the least qualified person to do things — The n00b Does the Thing. Just joined the team and the onboarding documentation wasn’t great? Perfect, because as the new engineer on the team your first task is updating in when you roll on. Don’t know anything about that part of the code and there’s a bug in production so you’re scared? Great, you own fixing the bug and finding someone who knows what they are doing to help and review as needed. Tech lead of the iOS app is onboarding a new engineer while running a big redesign? Tech lead sets direction and the new engineer migrates components.

The n00bs bring fresh eyes, and often help find better solutions. They enable a pull model to work around the expert’s curse — the experts don’t know all the knowledge locked up in their heads that they need to impart, so the n00bs pull it out of them by being thrown into the deep end and needing expert help.

The Approval Matrix with Round Robin Pattern

Many problems like code reviews, design reviews, interview training, bug fixes, release shepherd, on call rotation, or production migration can be solved with the Approval Matrix with Round Robin Pattern. It works great for distributing responsibilities that tend to get concentrated, and also for things that nobody really wants to do.

An Approval Matrix is just a table where rows are team members and columns are things they are approved to do. So lets say that your team is growing fast and only two people can do the technical interviews but you have 15 on the team right now. The leads want to make sure the whole team is trained to do each step of the interview from the intro call, to technical phone screen, to the final multi-hour deep dive, and they don’t want to sacrifice quality and just have everyone start interviewing with no supervision.  They’d start with three columns for intro, phone, deep dive, and interview lead. They’d begin having engineers shadow each step, then do each step with a lead on a call, then get a sign off and check next to their name for that interview step. After they become experts, then the leads will sign them off to join the group of interview leads, and now they can train new engineers on how to interview. Before long the matrix will be filled in, and the initial leads don’t even need to train interviewers anymore because they’ll have trained new leads.

Round Robin Rotation is a pattern that goes with the Approval Matrix. I can’t tell you how many times I’ve heard — no we don’t need a system, that’s too much process, we’re just going to rotate the [code reviews | bug fixes | deployments] organically. Bullshit. This never happens. Every time a team has said this, it ends up with Code Review Overlords, Bug Fixers, and Deploy Tzars. Once you build an Approval Matrix, just run a round robin process — whoever is up next in the round robin does the next [code review | bug fix | production deploy | pager duty cycle]. If you want to get fancy you can even automate the round robin using git hooks or something. You want as many people on the team as possible approved for as many responsibilities as possible, and you want the responsibilities rotating as evenly as possible — if you get this right, the entire team is empowered with the context to do anything, feels balanced and fair, avoids hero worship, and is much more productive. It also reduces the emotional overhead to negotiate this and figure out who does it each day. Let go of trying to direct it manually and just schedule it. Automating coordination should appeal to engineering sensibilities, but for some reason teams are often happy to automate the build and tests, but resist automating this kind of human coordination, and as a result they burn a lot of calories doing it manually.

Applying distributed ownership to design and code reviews

If an engineer is building something that’s going to be more than a week of work, that’s probably the size at which you want to do a quick design review.

Have a style guide for architecture / design, and use simple design review principles; i) Limit to one pagers most of the time, and only rarely longer for really big and complex things — which should still be ruthlessly as short as possible, ii) limited to one round of comments, iii) engineer owns final decisions, iv) tech lead has final sign off, v) consider archiving and indexing design reviews to gives everyone a snapshot of what other teams are doing, reference past reviews as president for emerging patterns, and because too many documents or comments piling up from a team almost always signs something has gone wrong with a team’s decision making process. Check out the Kafka Improvement Process as one good example of a sane design review process.

Doing design reviews for stuff that will take a week or longer stops misguided or hasty designs that introduce immediate debt. Applying these design review principles will allow a fast throughput of new designs that stop the twin evils of fruitless debates and design by committee. The designer has maximum ownership subject to signoff from the tech lead. Tech lead should just be sanity checking important issues and ensuring standards here in order to propagate out maximum distributed ownership for technical decisions  – they need to be careful to resist tweaking work based on immaterial preferences.

Code reviews are similar to design reviews. Everyone has their own personal opinions, so we want to focus the code review on common guidelines rather than idiosyncratic troll comments and flame wars.  Tech leads own facilitating among the team to generate a proper style guide that is the prevailing right way to do things. Make sure style guide and reviews include not just design but over-engineering, testing, non-functional, and production infra considerations.

Applying distributed ownership to other decisions and documentation

Design and code reviews are ultimately just documentation of two types of decisions — how to design something and how to implement it. We make lots of other decisions on engineering teams, and need to delegate authority to propose, review and make these decisions. It turns out the same sort of general approach works for everything — just be clear about goals and decision owners, have people document what they propose (and keep that structured in a sane and searchable way like google docs, wiki, etc), have other people review it, and sign off on the final. Roadmaps, plans, experiments, updates, and even external-facing proposals can all be done the same way. By centering around written documents and review, you’ll create a culture of clear reasoning and data-driven decision making, make better and more rigorous decisions, have good governance over more broadly distributed authority, and provide context on how previous decisions were made to inform current and future work.

Once you get a feel for the basic tools like Approval Matrix, Round Robin, and propose-review-decide cycles, you find that you can attack most management problems using various configurations of the same simple set of ideas.

I’d like to thank all the reviewers for their feedback and input including; Ajit Singh, Jean Hsu, David Andrzejewski, Jason Warner, Michael Montano, Aaron Boodman, Emil A Eklund, Lindsey Simon, Jason Wolfe, and Dave Cohen.