開源日報 每天推薦一個 GitHub 優質開源項目和一篇精選英文科技或編程文章原文,堅持閱讀《開源日報》,保持每日學習的好習慣。
今日推薦開源項目:《火箭發射信息 SpaceX-API》
今日推薦英文原文:《So You Broke Everything. Now What?》
開源日報第792期:《火箭發射信息 SpaceX-API》
今日推薦開源項目:《火箭發射信息 SpaceX-API》傳送門:GitHub鏈接
推薦理由:r/SpaceX 是 SpaceX 粉絲自發組織的社區,該項目就是他們整理的 API,基本包含了 SPaceX 成立迄今的所有火箭數據(但是並不包含火箭的製作步驟)。
今日推薦英文原文:《So You Broke Everything. Now What?》作者:Steven Popovich
原文鏈接:https://medium.com/better-programming/so-you-broke-everything-now-what-2461c34b0a97
推薦理由:如何面對突如其來的bug和甲方

So You Broke Everything. Now What?

How to handle on-call. Don』t worry, everything will be fine

開源日報第792期:《火箭發射信息 SpaceX-API》
(Exactly. Photo by Jasmin Sessler on Unsplash)
So shit hit the fan and you are out of toilet paper. And the fan is still blowing. And the plug welded into the wall. And everyone knows it was you. Now what?

Well, as the roll suggests, the first thing you need to do is not panic. These things happen. Your job as an ethical, professional engineer is not to be perfect, it』s to try. This is how.


Let』s set up a scenario. You built a new system to send messages between your users. But, uhhh, it』s not doing that. It』s doing nothing, actually. People send messages to it and then they don』t get anywhere. Oh boy.

It was working for a week, but all of a sudden, it stopped. You were paged, and you test the system to see that messages are not being sent between users. At least your monitoring system works…

So, what to do?

Seriously, Remain Calm

Do what you have to do to remain calm. Go for a walk, listen to music, do some deep breathing. I wouldn』t suggest taking a shot, but hey, no judgment. This is both for looks and will literally help you resolve the problem.

I know it sounds like a cliche, but we all think better when we』re calm. And others around you will be calm, too. I know, it』s scary to break something customer-facing, but it』s not the end of the world, and it won』t be the end of your career if you handle it right.

Blame Should Not Exist

Well, when it is your fault, it』s okay to say 「my bad.」 But blame should not exist. Certainly, do not blame others.

When addressing the issue or incident at hand, say 「we made this decision」 as opposed to 「they made this decision.」 Use us and I instead of him, her, or you.

This language is very important to keep your team focused on fixing the problem at hand. Think about it — what does saying 「so-and-so did xyz」 do to get customers back to a usable state?

Blame doesn』t even make sense

Yeah, really. Think about that. When you assign fault to someone else, all you are doing is making yourself feel better. That』s it. I』ll explain.

You built a messaging system and now it is broken. Was this your intention? Did you do everything in your power to make sure the system would work? Did you tick all the boxes to deliver the best possible system you could?

If yes, and let』s assume everyone does that, then how can you be blamed? Sure, you built the system, and it』s broken. Shit happens! That』s how the cookie crumbles.

Cars break down. Lawnmowers stop working. Sometimes your poop doesn’t flush. Should we hang the people that built and designed these systems? I think not.

So don』t blame. It only serves to create bad tensions, slow down the restoration of your system, and swell egos.

The software will always work itself out. Human relationships don』t always.

Yes, I acknowledge that there are truly bad actors in the world. Sometimes people are lazy. Sometimes they don』t do the best job. But I refuse to operate on the premise in this sense. Trust is a two-way street.

Don』t Stop Until the Problem Goes Away

Okay, time to practice what you preach. Remember when I said that everyone tries their best? Now is the time to demonstrate it. Persistently investigate the problem and work towards a solution.

Don』t be afraid to get help. You really are going to have to swallow your pride on this. There is nothing wrong with getting help when you need it. Trust me, your job is in much greater jeopardy if you try to quietly fix a customer-facing issue in the corner by yourself than if you just announce the issue and get it fixed faster.

Remember, doing the best job means getting the issue fixed as soon as possible. Getting the issue fixed as soon as possible means getting help when you need it.

And get the right help

Now even though we aren』t blaming people, that doesn』t mean we can』t get help from people with context.

Let』s say your co-worker Sammy actually built the messaging system, but you are on call so you are the one who gets paged for it. It is totally okay to ask Sammy for help. They have the most context on how to help.

Don』t say 「Hey, uh Sammy, your shit broke-ed.」 Ugh, I cringe even writing that. Say, 「Hey Sammy, it looks like user messaging is broken. I have no clue what』s going wrong — you have a second to help me look?」

This is you doing your best job. Bringing in Sammy is the smart thing to do to mitigate customer impact as soon as possible.

Keep Your Focus

At all stages when dealing with a problem, keep in mind your number one job: Mitigate customer impact.

Say it with me: Mitigate customer impact. As fast as possible. Without regard to a root cause.

This makes a lot of developers cringe. It is at the core of our nature to try to understand the root cause of things. It』s literally what we do every day and it』s human nature.

Fight this urge! It is so important. During the time of an incident or broken system, you need to stay focused on getting the customers back to an operational state.

So your messaging is not working. You go to the cloud provider』s (AWS, Digital Ocean, the like) boxes that are supposed to have containers on them. Those containers that are supposed to be doing the work of sending messages back and forth between users are dead.

Why, though?

What happened to cause these containers to die? I guess the processes on them could stop, or maybe they ran out of memory, or I guess—Stop.

If you are supposed to have running containers for your system to work, then shouldn』t restarting or recreating new containers get your system to a functional state? You can and should figure out what caused the containers to die—later. Right now, spin up new containers and get your system back up and functional.

Have a Backup Plan

Finally, I want to talk about something people seem to miss in dev ops. Part of doing the best job you can is realizing that things break and preparing for that eventuality.

When doing something new, have a way to quickly roll it back. At my company, we have various in-house tools for managing our boxes and deployment images so if we roll out something that doesn』t work, we can roll back quickly.

Or maybe you are making a configuration change. I hope you use something like Terraform so you can put configuration changes into something with history — Git, in the case of Terraform. Then you can easily roll back configuration changes and keep track of when and why you make certain changes.

This isn』t always possible, though. With our new messaging system, we couldn』t roll back to an old, functioning system. So in cases like this, just have your system heavily documented and communicated. Maybe roll out the system in phases, and hopefully, you use staging environments.


Alright, that』s it for today. Don』t freak out when something you wrote broke. If you write code for any period of time in any capacity, it will happen. There』s a 100% chance of that.

What determines what kind of developer and teammate you are is how you handle it.

Thanks for reading.


下載開源日報APP:https://openingsource.org/2579/
加入我們:https://openingsource.org/about/join/
關注我們:https://openingsource.org/about/love/