Two Ways To Give A Team Autonomy

I worked at two companies recently. Both are decent size. Both have real engineering teams that ship to production every week. They take very different approaches to one specific question: how does an engineer get access to the cloud to do their job?

I am not going to name them. For this post, it is Company A and Company B. The point is the approaches, not the brands. I also want to make clear that I am not here to say one is right and one is wrong. Both teams I worked with shipped good software. Both teams had engineers who cared. The different access model changes how the work feels, and what kind of mistakes are easy to make. That is the part I want to write about.

The two setups

Company A: team owns the account, GitOps decides who has access

At Company A, each team has their own AWS accounts. Both the production one and the staging one. The accounts are genuinely owned by the team. Production is a bit more careful: nobody has write access by default. If I need to make a change, I have to ask a teammate to escalate me from read-only to admin. Read and write permission. Staging is simpler. I get admin on staging from day one. I can break staging however I want, as long as I do not break prod.

The access itself is managed through GitOps. There is a repo with a list of who has what access in which account. I can open a PR to add myself, or to add a teammate, or to remove someone who left the project. The PR has to be approved by someone on the team. That is it. No ticket, no Operations team in the loop, no waiting. The PR is the change. Once it merges, the change is live.

The cool part is that the audit log is the git log. Every access change is a commit, with a reason in the PR description, reviewed by a real person on the team. If you want to know who had admin on the prod account in March, you can find out in five minutes.

Company B: locked accounts, tickets, least privilege

At Company B, the cloud accounts are locked down. I cannot just log in and do things. If I need to do something, I raise a ticket. The ticket has to say which specific service I need, and what I am going to do with it. The system gives me the lowest permission I can get away with for that service. Sometimes that is fine. Sometimes I find out halfway through my work that I need three more services, and I have to raise three more tickets.

There is no GitOps. All access changes go through the Operations team. They review the ticket, they apply the change, they close the ticket. The engineer and the Operations team have never met in most cases. The audit log is the ticket system. It exists, it is searchable, but it is a different shape from a git log. You can find the change, but you cannot easily see why the change was made, or who on the engineering side argued for it.

The day-to-day difference

The most obvious difference is speed. At Company A, if I need admin on prod for an hour, I ask a teammate, they approve, and I am in. Total time: maybe 15 minutes, mostly the time it takes for a human to look at Slack. At Company B, the same request is a ticket, a wait, a review, an approval, and then access. Total time: hours, sometimes a day, sometimes more if the Operations team is busy or if the ticket is missing context they want.

Speed is the thing everyone notices first. But the deeper difference is what speed does to how you work.

At Company A, I can try things. I can spin up a new resource, test an idea, break something, and learn from it, all inside a 30-minute window. The cost of being wrong is small, and the cost of being curious is also small. That changes what I work on. I am more likely to try the slightly weird approach, the one that might not work, because trying it is cheap.

At Company B, I think twice before I try anything. Not because the Operations team is slow on purpose. They are not. It is just that asking for access has a cost, so I optimise for not asking. I pick the safe approach, the one I know will work, the one I can plan for in advance. The slightly weird approach is the one that needs a lot of permissions, and I do not have a lot of permissions, so I do not try it as often.

That is the part I did not expect. The access model changes what kind of engineering work is easy. Not just how fast I can do it, but what I choose to do in the first place.

Who pays when things go wrong

Both companies had outages. Both companies had security incidents. The shape of the response was different, and I think that is interesting.

At Company A, when something went wrong, the team owned the fix end to end. We had the access, we had the context, we had the logs, we could roll back our own change. The Operations team was not in the loop unless we wanted them to be. Sometimes we did, because they had cross-team context we did not have. But we did not need to wait for them.

At Company B, when something went wrong, the team that owned the service often had to coordinate with the Operations team to do the fix. The Operations team had the access. The engineering team had the context. The handoff was where time got lost. There were also cases where the engineering team did not have logs they needed because the service that produced the logs was not in the access they had been granted. Those are the worst kind of incidents: the kind where the system is broken and the person who can fix it cannot see why.

I am not blaming the Operations team at Company B. They were good at their job. The model itself just makes the handoff a bottleneck during an incident, and bottlenecks during incidents are expensive.

What I think both models are trying to do

I think both companies are trying to solve the same problem, just with different priorities.

Company A is trying to give the team enough rope to move fast, and is trusting the team not to hang themselves. The safety net is the PR review, the git log, and the fact that staging is a safe place to break things before prod. The bet is that a team of adults, with a small amount of process, will make better decisions than a central team that does not know the context.

Company B is trying to make sure that nobody, including a careless engineer, can do too much damage. The safety net is the ticket, the Operations team, and the principle of least privilege. The bet is that the cost of moving a bit slower is worth it, because the cost of one bad change is high, and the central team is a good filter for bad changes.

Both bets can be right. It depends on what kind of company you are, what kind of data you hold, and what kind of engineers you hire. I do not think there is a universal answer here.

What I would want in a perfect world

If I am honest, the access model I want is closer to Company A, with a few specific things from Company B mixed in. But I have not worked out the exact shape yet. I will write more about this when I have a clearer picture. For now, I just want to put both models on the table, side by side, and let the tradeoffs speak for themselves.