On Processes for Solving Problems Big and Small
Your first problem is how to solve a problem.
We are awash in people / organizations sharing their ideas to solve big societal conundrums. But for each case, is the problem itself even the real problem? And does the proposed solution have evidence to support it, and if it does—is that data accurate, is it cherry-picked, will it work in this context? Should the solution be multi-pronged?
If only there were processes to solve problems, or in terms of politicians’ favorite metaphor: fight the problem.
Well there are. In fact you’ve probably used processes yourself on occasion without even realizing it.
Most technology workers have used problem solving processes be they dictated by the company or just what they happened to come up with that works repeatedly. Another clear cut example might be vehicle mechanics: you receive some possibly unreliable data about a problem (“my car is making a funny noise”), you investigate, maybe just use a scanner and it tells you the cause, and you choose the fix that usually works for that cause.
I’m going to describe a lean manufacturing process called 8D (8 Disciplines). It’s not the only way, but 8D is a useful one to go over because it’s kind of a superset of what goes into any problem solving approach. I learned it in a training course when I worked for iRobot.
Here are the eight steps of 8D:
- Assemble the team
- Define the problem
- Contain the problem
- Root cause analysis
- Choose the permanent solution
- Implement the solution and verify it
- Prevent recurrence
- Congratulate the team
1. Assemble the Team
With an initial, rough concept of the problem, a team should be assembled to execute the 8D steps. They don’t have a real problem definition yet, that will happen in step 2. The size of an 8D team (at least in companies) is typically 5 to 7 people.
The team must have a leader; this leader makes agendas, synchronizes actions and communications, resolves conflicts, etc.
The rest of the team is assembled as appropriate—some general rules for a candidate:
- Unique point of view.
- Logistically able to coordinate with the rest of the team.
- Not committed to preconceived notions of “the answer.”
- Can actually accomplish change that they might be responsible for.
The team should be justified—somebody is paying for the team, be it a company or tax money. How will the team defend themselves when people ask, “Why should we care?”
2. Define the Problem
The fledgling problem solver invariably rushes in with solutions before taking time to define the problem being solved.D.C. Gause & G.M. Weinberg, Are Your Lights On?
The team will make an initial problem statement without presupposing a solution. They should attempt to define the “gap” (or error)—the big difference between the current problematic situation and the potential fixed situation. The team members should all be interested in closing this gap.
The initial concept or report of a problem could be wildly incorrect.
Let’s say somebody throws my robot out of an airplane, and it immediately falls to the ground and breaks into several pieces. This customer then informs me that this robot has a major problem when flying after being dropped from a plane and that I should improve the flying software to fix it.
Here is the mistake: The problem has not been properly defined. The robot is a ground robot and was not intended to fly or be dropped out of a plane. The real problem is that a customer has been misinformed as to the purpose and use of the product.
When thinking about how to improve something, you should consider: Have you made an assumption about the issue that might be obscuring the true problem? Did the problem emerge from a process that was working fine before? What processes will be impacted? If this is an improvement, can it be measured, and what is the expected goal?
The team should attempt to grok the issues and their magnitude. Ideally, they will be informed with data, not just opinions.
Just as with medical diagnosis, the symptoms alone are probably not enough input. There are various ways to collect more data, and which methods you use depends on the nature of the problem. For example, one method is the 5 W’s and 2 H’s:
- Who is affected?
- What is happening?
- When does it occur?
- Where does it happen?
- Why is it happening (initial understanding)?
- How is it happening?
- How many are affected?
The stakes can be high in a corporation and depend on the definition and context of the problem. Now imagine grand humanity-affecting problems and how much is at stake there. Stakeholders, customers, maybe even politicians and taxpayers, set-in-their-ways problem-solvers—all of these and more may come in blazing with their preferred solution prematurely.
Don’t mistake a solution method for a problem definition—especially if it’s your own solution method.D.C. Gause & G.M. Weinberg, Are Your Lights On?
3. Contain the Problem
Some problems are urgent, and a stopgap must be put in place while the problem is being analyzed. This is particularly relevant for problems such as product defects which affect customers.
Some brainstorming questions are:
- Can anything be done to mitigate the negative impact (if any) that is happening?
- Who would have to be involved with that mitigation?
- How will the team know that the containment action worked?
Before deploying an interim expedient, the team should have asked and answered these questions (they essentially define the containment action):
- Who will do it?
- What is the task?
- When will it be accomplished?
A canonical example: You have a leaky roof (the problem). The containment action is to put a pail underneath the hole to capture the leaking water. This is a temporary fix until the roof is properly repaired and mitigates damage to the floor.
Don’t let the bucket of water example fool you—containment can be massive, e.g. government bailouts. Of course, the team must choose carefully: Is the cost of containment worth it?
4. Root Cause Analysis
Whenever you think you have an answer to a problem, ask yourself: Have you gone deep enough? Or is there another layer below? If you implement a fix, will the problem grow back?
Generally in the real world events are causal. The point of root cause analysis is to trace the causes all the way back for your problem. If you don’t find the origin of the causes, then the problem will probably rear its ugly head again.
Root cause analysis is one of the most overlooked, yet important, steps of problem solving. Even engineers often lose their way when solving a problem and jump right into a fix for a red herring.
Typically, driving to root cause follows one of these two routes:
- Start with data—develop theories from that data.
- Start with a theory—search for data to support or refute it.
Either way, team members must always remember keep in mind that correlation is not necessarily causation.
One tool to use is the 5 Why’s, in which you move down the “ladder of abstraction” by continually asking: “why?” Start with a cause and ask why this cause is responsible for the gap (or error). Then ask again until you’ve bottomed out with something that may be a true root cause.
There are many other general purpose methods and tools to assist in this stage; I will list some of them here, but please look them up for detailed explanations:
- Process flow analysis: Flowchart a process then attempt to narrow down what element in the flow chart is causing the problem.
- Fishikawa: Use an Ishikawa (aka fishbone) diagram to navigate causes of the problem-as-an-effect.
- Pareto analysis: Generate a Pareto chart, which may indicate which cause (of many) should be fixed first.
- Data analysis: Use trend charts, scatter plots, etc. to assist in finding correlations and trends.
- Brainstorming: Generate as many ideas as possible, and elaborate on the best ideas. Also can be used for solution generation.
And that is just the beginning—a problem may need a specific new experiment or data collection method devised.
In software, reproducing a bug is what we ideally hope to achieve as the first step to finding a cause. That can be tricky sometimes. Once I had to make a test script to reboot a computer brick thousands of times, recording data each time, in order to narrow down an intermittent startup issue. Similarly, recently a Red Hat engineer solved a rare bug in Linux kernel 6.4 by Booting Linux 292,612 times.
Ideally you would have a single root cause, but that is not always the case.
5. Choose the Permanent Solution
The solution must be one or more corrective actions that solve the cause(s) of the problem. Corrective action selection is additionally guided by criteria such as time constraints, money constraints, efficiency, etc.
This is a great time to simulate/test the solution, if possible. There very well might be side effects either in the system you’re trying to fix or in related systems.
Some of the solution options can naturally come out of the previous step. How to come up with solutions beyond that—hopefully given the root cause—is way beyond this article. Creative problem solving can be extremely important in this step. For something like a software bug, there’s usually a few obvious options and then it’s an issue of deploying. For major complex problems, there’s probably going to be a much bigger space of potential solutions to navigate.
You must verify that the corrective actions will in fact fix the root cause.
Ideally the corrective actions don’t cause bad side effects either. However, when dealing with complex and/or chaotic systems, there may always be side effects to a “solution,” and possibly catastrophic ones. So we’ll need more tools than are in this little article to deal with that. Context will be needed to decide whether a side effect is “good” or “bad.” And does the side effect cause only a temporary disturbance in local feedback control systems? Is it ok, for instance, to loose some percentage of a type of wild organism but that increases another type? Do things naturally go back to some kind of equilibrium after the human intervention? Will improving a variable for humans in one part of the world actually cause devastation for humans in some other part?
6. Implement the Solution and Verify It
This is the stage when the team actually sets into motion the correction action(s). But doing it isn’t enough—the team also has to check to see if the solution is really working.
For some issues the verification is clean-cut. Some corrective actions have to be evaluated with effectiveness, for instance some benchmark. Depending on the time scale of the corrective action, the team might need to add various monitors and/or controls to continually make sure the root cause is squashed.
7. Prevent Recurrence
It’s possible that a process will revert back to its old ways after the problem has been solved, resulting in the same type of problem happening again. So the team should provide the organization or environment with improvements to processes, procedures, practices, etc. so that this type of problem does not resurface.
8. Congratulate the Team
Party time! The team should share and publicize the knowledge gained from the process as it will help future efforts and teams.
Fish discover water last.Ethiopian proverb
The 8D approach is great for certain classes of problems, especially expensive large scale issues, and those that involve an organization (or many orgs). I think elements of this should be more rigorously applied to the big problems of our society and our environment.
For smaller or more local problems, like run-of-the-mill computer code bugs, some of this is still very useful. For instance defining the problem, root cause analysis (debugging) and updating the code with the fix.
Even computer problems may require a temporary stop-gap (contain the problem) for the users / customers while you figure out the real solution and are able to get it deployed properly. As an example, once again low level vulnerabilities have been found recently in Intel and AMD processors, so AMD and Intel will for some processors release a mitigation of microcode patches and/or BIOS updates. But they still have a bigger final resolution to updating manufacturing if they’re still making those chips and presumably make sure they don’t design that same flaw into newer CPUs.
Massive problems that involve government politics often seem to me like macrocosms of anti-patterns you see in corporations. For instance, the loudest person in the room getting their solution as priority. Or tackling the low-hanging fruit even if that fruit is not the root cause or is not the major blocker. Another classic I see is not even defining the problem properly or people with different problem definitions that can’t agree.
And of course not agreeing on the potential solutions for big politically-entrenched issues such as climate change results in a lot of cross-talk and arguments without people necessarily being aware that they’re not even on the same page—for instance, if a group is against your policy to ban something, that doesn’t mean they don’t want the same end result, just they don’t want to lose the money, deal with side effects, etc. of your policy and/or don’t think your solution will work properly at all. So much of that’s going on, not just with climate change. If you don’t support my particular “Policy X to Save the Widgets,” then you must hate Widgets and be an evil person. Any other proposed solution to Save the Widgets is obviously a scam or a conspiracy theory or something so dangerous that obviously there would be a consensus amongst experts and politicians to promote that. And so on.
This is not a political article. It’s meta to that, this is literally some generic approaches to solving problems. This will annoy people stuck in the mode of interaction that I see so much these days that assumes and interprets all texts / speech acts as inherent declarations of in-groups. Even when pretending the equally silly falsehood that most speech acts are statements of fact, there’s this childish mode of operation really running the show of did you say the right words? And if you didn’t say the right words are you a bad person, are you in another tribe, etc. This may be leading to a very backwards and ungrounded approach to reality—if you have to start first with the good words and work from there into problem defining and solution generation…and if a solution didn’t work or is incomplete, should you double-down on it to avoid other solution spaces that might involve the bad words? And so on…