开源日报 每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,坚持阅读《开源日报》,保持每日学习的好习惯。
今日推荐开源项目:《火箭发射信息 SpaceX-API》
今日推荐英文原文:《So You Broke Everything. Now What?》
开源日报第792期:《火箭发射信息 SpaceX-API》
今日推荐开源项目:《火箭发射信息 SpaceX-API》传送门:GitHub链接
推荐理由:r/SpaceX 是 SpaceX 粉丝自发组织的社区,该项目就是他们整理的 API,基本包含了 SPaceX 成立迄今的所有火箭数据(但是并不包含火箭的制作步骤)。
今日推荐英文原文:《So You Broke Everything. Now What?》作者:Steven Popovich
原文链接:https://medium.com/better-programming/so-you-broke-everything-now-what-2461c34b0a97
推荐理由:如何面对突如其来的bug和甲方

So You Broke Everything. Now What?

How to handle on-call. Don’t worry, everything will be fine

开源日报第792期:《火箭发射信息 SpaceX-API》
(Exactly. Photo by Jasmin Sessler on Unsplash)
So shit hit the fan and you are out of toilet paper. And the fan is still blowing. And the plug welded into the wall. And everyone knows it was you. Now what?

Well, as the roll suggests, the first thing you need to do is not panic. These things happen. Your job as an ethical, professional engineer is not to be perfect, it’s to try. This is how.


Let’s set up a scenario. You built a new system to send messages between your users. But, uhhh, it’s not doing that. It’s doing nothing, actually. People send messages to it and then they don’t get anywhere. Oh boy.

It was working for a week, but all of a sudden, it stopped. You were paged, and you test the system to see that messages are not being sent between users. At least your monitoring system works…

So, what to do?

Seriously, Remain Calm

Do what you have to do to remain calm. Go for a walk, listen to music, do some deep breathing. I wouldn’t suggest taking a shot, but hey, no judgment. This is both for looks and will literally help you resolve the problem.

I know it sounds like a cliche, but we all think better when we’re calm. And others around you will be calm, too. I know, it’s scary to break something customer-facing, but it’s not the end of the world, and it won’t be the end of your career if you handle it right.

Blame Should Not Exist

Well, when it is your fault, it’s okay to say “my bad.” But blame should not exist. Certainly, do not blame others.

When addressing the issue or incident at hand, say “we made this decision” as opposed to “they made this decision.” Use us and I instead of him, her, or you.

This language is very important to keep your team focused on fixing the problem at hand. Think about it — what does saying “so-and-so did xyz” do to get customers back to a usable state?

Blame doesn’t even make sense

Yeah, really. Think about that. When you assign fault to someone else, all you are doing is making yourself feel better. That’s it. I’ll explain.

You built a messaging system and now it is broken. Was this your intention? Did you do everything in your power to make sure the system would work? Did you tick all the boxes to deliver the best possible system you could?

If yes, and let’s assume everyone does that, then how can you be blamed? Sure, you built the system, and it’s broken. Shit happens! That’s how the cookie crumbles.

Cars break down. Lawnmowers stop working. Sometimes your poop doesn’t flush. Should we hang the people that built and designed these systems? I think not.

So don’t blame. It only serves to create bad tensions, slow down the restoration of your system, and swell egos.

The software will always work itself out. Human relationships don’t always.

Yes, I acknowledge that there are truly bad actors in the world. Sometimes people are lazy. Sometimes they don’t do the best job. But I refuse to operate on the premise in this sense. Trust is a two-way street.

Don’t Stop Until the Problem Goes Away

Okay, time to practice what you preach. Remember when I said that everyone tries their best? Now is the time to demonstrate it. Persistently investigate the problem and work towards a solution.

Don’t be afraid to get help. You really are going to have to swallow your pride on this. There is nothing wrong with getting help when you need it. Trust me, your job is in much greater jeopardy if you try to quietly fix a customer-facing issue in the corner by yourself than if you just announce the issue and get it fixed faster.

Remember, doing the best job means getting the issue fixed as soon as possible. Getting the issue fixed as soon as possible means getting help when you need it.

And get the right help

Now even though we aren’t blaming people, that doesn’t mean we can’t get help from people with context.

Let’s say your co-worker Sammy actually built the messaging system, but you are on call so you are the one who gets paged for it. It is totally okay to ask Sammy for help. They have the most context on how to help.

Don’t say “Hey, uh Sammy, your shit broke-ed.” Ugh, I cringe even writing that. Say, “Hey Sammy, it looks like user messaging is broken. I have no clue what’s going wrong — you have a second to help me look?”

This is you doing your best job. Bringing in Sammy is the smart thing to do to mitigate customer impact as soon as possible.

Keep Your Focus

At all stages when dealing with a problem, keep in mind your number one job: Mitigate customer impact.

Say it with me: Mitigate customer impact. As fast as possible. Without regard to a root cause.

This makes a lot of developers cringe. It is at the core of our nature to try to understand the root cause of things. It’s literally what we do every day and it’s human nature.

Fight this urge! It is so important. During the time of an incident or broken system, you need to stay focused on getting the customers back to an operational state.

So your messaging is not working. You go to the cloud provider’s (AWS, Digital Ocean, the like) boxes that are supposed to have containers on them. Those containers that are supposed to be doing the work of sending messages back and forth between users are dead.

Why, though?

What happened to cause these containers to die? I guess the processes on them could stop, or maybe they ran out of memory, or I guess—Stop.

If you are supposed to have running containers for your system to work, then shouldn’t restarting or recreating new containers get your system to a functional state? You can and should figure out what caused the containers to die—later. Right now, spin up new containers and get your system back up and functional.

Have a Backup Plan

Finally, I want to talk about something people seem to miss in dev ops. Part of doing the best job you can is realizing that things break and preparing for that eventuality.

When doing something new, have a way to quickly roll it back. At my company, we have various in-house tools for managing our boxes and deployment images so if we roll out something that doesn’t work, we can roll back quickly.

Or maybe you are making a configuration change. I hope you use something like Terraform so you can put configuration changes into something with history — Git, in the case of Terraform. Then you can easily roll back configuration changes and keep track of when and why you make certain changes.

This isn’t always possible, though. With our new messaging system, we couldn’t roll back to an old, functioning system. So in cases like this, just have your system heavily documented and communicated. Maybe roll out the system in phases, and hopefully, you use staging environments.


Alright, that’s it for today. Don’t freak out when something you wrote broke. If you write code for any period of time in any capacity, it will happen. There’s a 100% chance of that.

What determines what kind of developer and teammate you are is how you handle it.

Thanks for reading.


下载开源日报APP:https://openingsource.org/2579/
加入我们:https://openingsource.org/about/join/
关注我们:https://openingsource.org/about/love/