今日推荐英文原文：《Debugging — A Philosophical Approach》
今日推荐英文原文：《Debugging — A Philosophical Approach》作者：Bogdan Litescu
Debugging — A Philosophical Approach
A DefinitionYou might ask what does philosophy have to do with debugging. I feel that’s a good clarification to start with. In common knowledge, philosophy has become the act of talking about the nature of things in general and, in particular, about the existence and purpose of life. This is a part of philosophy that most people face at some point or another in life.
But there’s another big part of it which is the glue that transcends all sciences. It’s about principles, methodologies, reasoning and understanding. The great philosophers in human history were, at the same time, mathematicians, physicists, logicians, historians and so on. The highest education degree one can get is a PhD, which stands for Doctor of Philosophy. It basically means that someone has reached a level where he or she can extrapolate principles and methods to apply to ideas in order to produce new research.
I derived the philosophical approach to debugging from being a programmer for more than a decade and from exploring human psychology, mostly in relation to myself. I believe these are two fields where debugging is a vital part to success. Now that’s a big word. For the sake of precision, in this article I assume that success consists of understanding the nature of things as a means to implement systems that produce desired effects.
Another distinction I’d like to make from the start is related to diagnosis. In medical sciences, in aviation, in car industry among others, people have been investigating for a long time why something doesn’t work the way it should and have put systems in place to easily identify when a failure occurs. What differentiate these from computer programming or psychology is that these systems change at a much slower rate and have a finite number of interactions between components. Fully describing everything at a certain point and keeping it relevant in time is feasible.
In computer science and psychology, it’s much more important to learn methods and principles, as the facts will vary wildly from individual to individual and from group to group. Both create systems that change rapidly in time. What is true today, could easily be obsolete in a few weeks.
From this perspective, I call debugging the set of principles and methodologies we can use to understand a specific behavior or state of an individual or system and describe it. Only when the debugging activity is complete will solutions and corrections can be coherently considered and applied.
Unwanted BehaviorsWhat triggers the debugging process is unwanted behavior. We notice these all around and within us. Sometimes we have systems in place that tell us that an unwanted behavior has occurred. This was the case with one of our websites at some point. It was both programatically tested and observed through personal experiences that the performance of the website was really poor sometimes. After struggling to figure it out for a few months, my team called upon me to assist. Although there are quite a few years since I’m not coding anymore, the universal principles and methods of debugging that I’ve been using are still valid. I will come back to this example as I explain various principles and methods I use.
In Pursuit of the TruthBefore going any further, let’s talk about motivation. Debugging is about finding the truth. I’ve seen programmers randomly applying fixes until the behavior was corrected and stopping there. They failed to find the truth. There is a byproduct of the debugging process that just as important as identifying and describing the root cause. That is the process of constantly improving ourselves. If we don’t understand what the problem was and why our change fixed it, we haven’t learnt anything. We are likely to produce the same wrong behavior over and over again and fix it every time. That’s one way to keep ourselves busy.
I find behavior to be the number one source of truths. By reading code, literature or listening to what people say, one might get deceived. We all make affirmations about how we’d act in a given situation only to realize that we act completely different when we’re actually facing the respective situation. Though our behavior is a source of truths, it is not the truth. In order to find it, we must start debugging.
Hypotheses and AssumptionsOnce a behavior has been found to be unwanted, people will start giving opinions as of why it occurs. This is a healthy discussion, but one thing to watch for is people expressing personal opinions as convictions, without verifiable arguments to sustain them. Now is the time to clarify either the opinion is in fact an hypothesis based on assumptions that haven’t been verified yet. That’s very easy to clarify. Just ask for arguments and what sustains them.
Going back to the example of the slow website, the general conviction in my team was that we needed to increase the number of CPUs on the server. When I asked why, they didn’t have any verifiable arguments. Increasing the CPUs might have solved the effects of the problem, at least for a while. But since our software is used by thousands of companies, I encouraged the team to find out what’s causing the behavior. Worst case scenario was that we would have found something that requires more CPU, and in that case it would still be useful to inform clients.
CausalitySo now we had two opinions:
- The server had too few resources.
- There is a defect or a resource intensive feature in our software.
Establishing causality between various hypotheses eliminates wasting on investigating the effects and makes it much clearer to decide where the debugging process should start.
Observation ToolsOnce the starting point has been decided, it’s time to get insights. Despite all society advances, the observation remains one of the most important instrument we can use to understand how things work. But in a complex system, observing everything is not feasible. Therefore, observation works like Google Maps. You first observe the big picture and then zoom in on area of interest for more specific observations.
Luckily, there’s a multitude of tools that we can use to make the observation process much faster and precise. When I started public speaking, I used to records videos of myself and then identify areas I could act on to improve. That’s an easy example. In real life, it’s a bit more difficult to put tools in place to observe our personal behavior. There are various gadgets that help to some extent. And there’s always a generic tool we can use, which is being conscious of one’s self. This is a skill that can be practiced through meditation and other techniques.
Validating AssumptionsGoing back to the slow website tools, we deployed a generic tool on the live site that recorded everything that was happening. We found that there were around 3 million database queries happening daily. That’s quite a lot more that we expected. However, we now had a metric that could validate other assumptions later.
During the debugging process we will find things that are objectively true. It could be that we have a tool to accurately measure something directly related to the behavior. Or, on a psychological level, it could be an external evaluation like, for example, asking input from our life partners if our unwanted behavior changed in the past week.
We can then use these truths to validate assumptions. As we go deep in the debugging process, a lot of what ifs will arise. We can then modify variables and observe how they affect the truths.
Asking the Right QuestionsObservations collect data. However, data itself is useless without understanding it. Another great principle that’s universally available is to question the data. There is a very popular framework called the 5 Whys. I find this works, but it’s not as easy as it sounds. Why were there 3 million database queries per day? Before we can answer this question we have to answer a lot more like what are those queries and how do we measure them?
On our slow website, looking at 3 million queries was not feasible. So we went in and deployed additional tools that counted how many times different queries have ran. We found that most of them were supposed to be cached in memory. Only now can we get to next why. So, why is the cache not working sometimes?
But, just to point out, this last question is based on the assumption that cache was the issue. This assumption is not validated yet. So, on the 5 Whys framework, sometimes one might work down a branch of Whys that are based on assumptions. If at some point, an assumption is invalidated, one has to go back one or more levels to previous Whys and find different assumptions.
Build Your Own ToolsAs you get deep into debugging, you might find the need to create your own tools to accurately observe or measure a component of the behavior. These tools don’t have to be revolutionary. In fact, people usually build specific tools from generic methods. Like, for example, composing a questionnaire and ask your colleagues to fill it in. Or, noting down how many times a day you perform a habit that you want to get rid of.
In my slow website example, we built a simple tool that displayed everything stored in cache.
Reproducible BehaviorNever underestimate the power of chance. During the debugging process, it might be that an assumption simply doesn’t validate because a certain condition has not been met during the experiment.
Therefore, in our pursuit for truth, we have to identify all components that cause the behavior. Only when we can reproduce it exactly can we act repeatedly on it to advance our understanding. Otherwise, we’re left to chance to validate or invalidate assumptions. This leads to another important method that’s been universally available since the dawn of human kind, the experiment.
In our slow website example, what made it so difficult to track down was that it appeared to happen randomly. We didn’t knew an essential condition that triggered it. To make it even worse, it didn’t happen when we tried to reproduce on a clone of the website. Therefore, we decided to use the tool to generate reports continuously to catch the moments when it did happen. This increased our chances to observe the faulty behavior from the caching perspective.
There’s two facts we found when using the custom built tools:
- There were some caches that had names we didn’t expect.
- The cache indeed seemed to randomly clear itself.
Reducing ComplexityOnce an unwanted behavior is reproducible, one can eliminate parts until the minimum number of components that still produce the problem is found. Reducing complexity removes noise which could hinder our investigation or slow down experiments. I remember a time when the tools for computer debugging were rudimentary. It was popular at that time for software engineers to remove parts of code until the issues didn’t reproduce anymore. Then, they knew that the problem had to be somewhere in the part that was last removed.
Going back to the slow website, fact 1 was more easily to address. So, we decided to act in that direction, as it would both fix a defect and answer the question either fact 2 was related. In the end, it was not.
Tracing Last StepsThe missing essential condition hides in something that happened before the behavior was observed through its effects. Perhaps the most popular example to give is about when forgetting where the car keys are. After finishing looking in all the places I’d expect them to be, I try to remember and even reproduce the last steps I took after entering the house earlier that day, only to remember that I was in a hurry to the bathroom. So, I find them on the sink.
In computer programming, the last steps are called Stack Trace. So we’ve built ourselves another little tool that would monitor the cache and collect the stack trace the exact moment it happen. Going back through the steps, we finally have found the root cause. That website actually hosted 3 websites. The other two were internal to us. Only when someone accessed the internal websites, the URL rewriter would wrongly look at the cache of the main website. Because it didn’t match what was expected, the URL rewriter would clear the cache for all 3 websites. This fully explained the randomness of the behavior, as the internal websites were used a few times a day.
Getting UnstuckSome unwanted behaviors will pull your hair out trying to get to the bottom of it. At times, you might find yourself stuck with no other assumption to validate. Do not despair. What I found to work is to get someone to help and put everything on the table. Debugging also involves a great deal of brainstorming, building logical arguments and thinking creatively about approaching the behavior from other perspectives.
One thing I found to be mostly true is that it’s usually small things that cause behaviors difficult to debug. And it makes sense. If it was something big, it would have been much easier to make an assumption and validate it.
In this regard, there’s another piece of advice that can make wonders: sleep on it.
Validated KnowledgeOnce you got to the bottom of the unwanted behavior you can move towards addressing it. That might be a major effort as well, but it’s beyond the scope of this article. What I’d like to emphasize here is the by product of the debugging process, which are new methods and tools you can use in the future to approach a similar situation. And it’s not something you’ve read about in a book, but something that you lived first hand. You went in pursue of the truth and found it. It creates emotions and confidence.
This kind of knowledge, being validated by own experience, rewrites connections in the brain. It develops our intuition and makes us better at debugging.