8 min read

The best troubleshooter doesn't look for the fault first

Table of Contents

When a system starts acting up, the first reflex is often: open the logs.

I understand that. Logs feel concrete. Something is there. An error message. A timestamp. A stack trace. A code. Something you can sink your teeth into.

But good troubleshooting rarely begins with looking for the fault.

Good troubleshooting begins with making the problem smaller.

Because in complex IT and OT environments, almost everything is connected to everything else. An application talks to a database. That database runs on a server. That server is connected to a network. That network runs through firewalls, switches, VLANs, DNS, certificates, permissions, scripts, interfaces, users, machines, processes — and sometimes an Excel file that nobody knows exactly why it exists.

If you start searching without direction, you can spend hours without getting closer to the cause.

You read logs. You see warnings. You find old errors. You discover deviations that may have always been there. Someone shouts that it’s “definitely the firewall.” Another says “the application was already acting strange yesterday.” Meanwhile pressure grows, more and more people get pulled in, and technical fog emerges.

Everyone is looking at something.

But nobody knows exactly which problem is being solved.

Not: where is the fault? But: where does the problem stop?

That is why I prefer to start somewhere else.

Is it not working on one workstation, or on all of them? Does it work locally but not from outside? Is it failing for one user, one role, one machine, one network segment, one point in time, one type of file, one production line, one batch, one API call?

What worked before? What has changed? What still works?

Those questions seem simple. Sometimes almost too simple. But they are often more important than the first technical analysis.

A failure is behavior within a context

A system is doing something that is not expected. But that does not automatically mean the system itself is broken.

I was once at an installation where a production line failed after a routine Windows update. Everyone was looking at the PLC software. Logical, because that is where the error message was. But the PLC code had not changed. The environment had.

The update had replaced a driver that handled communication with a serial port. The system was doing exactly what it was supposed to do — only the layer beneath it had been pulled out from under it.

I keep seeing that pattern.

The error message points to the application, but the cause is in the environment. The code hasn’t changed, but the driver has. Or the SSL certificate has expired while the entire team is searching in the application logic. Or data is being routed to the wrong gateway after a network change, and the application gets a timeout that looks like a bug.

One of the most difficult variants: a firewall with deep packet inspection that silently adds something to an HTTP header. The application fails, the request looks normal in the logs, but somewhere along the way the packet is slightly different from what the server expects.

You can spend days in the code without finding it, because the problem is not in the code.

The error message is not the cause. It is only the place where the system began to complain.

A thermometer does not cause a fever. A log line does not cause an outage. An alarm on an HMI is not automatically the problem in the machine. It is a signal. And signals must be read within the whole.

Look for the contrast

That is why I first look for contrast.

This works. That doesn’t. Yesterday it worked. Today it doesn’t. This machine responds normally. That machine doesn’t.

That contrast is the entry point.

From there you can form hypotheses. Not as a wild guess, but as a workable route. If the problem only exists in one network segment, you don’t need to go through all the application code first. If the problem only occurs with new records, you look earlier at data, validation, or a process change. If the problem only occurs after a certain time, you look at batch processes, certificates, scheduled tasks, resource usage, or external interfaces.

This way troubleshooting is not a search through a haystack, but a controlled narrowing.

Experience can also mislead

That requires discipline. Because it is tempting to dive directly into the technology. Especially if you are technically strong.

You see an error message and you think: that’s where it is.

You recognize a pattern. You have seen something like this before. You want to solve it.

But sometimes a fault looks like something you know, while the cause is somewhere else. A database error caused by permissions. A network error by DNS. An application error by expired certificates. A performance problem by logging, storage, locking, or a process that runs slightly differently than before.

In software you see the same. A service fails because a downstream dependency returns a different response format after an update. Everyone debugs the service. Nobody looks at the dependency.

Or an API works perfectly in staging but breaks in production. Not because of the code, but because of an environment variable that is set differently, a different certificate, or a stricter network policy.

The system is complaining in the wrong place.

And if you only look where it complains, you are digging in the wrong place.

In OT, behavior is sometimes more important than documentation

In OT environments this becomes even sharper.

There, behavior is sometimes more important than what is on paper. An installation has been running for years. People know from experience what is normal. A small delay can have meaning. An unusual sound, a timing difference, a manual action, or an old bypass may have become part of the real system.

In software you know that too.

That stored procedure from 2011 that nobody dares touch. That try-catch block that was “temporarily” put in and has now been running in production for three years. That legacy service that should have been replaced, but that twenty other systems depend on.

On paper it may not be supposed to work that way.

In practice it does work that way.

That is why during outages I like to ask people on the floor — whether they are operators or developers:

What is normal behavior?

Not because they always know the technical cause. But because they often see more quickly what deviates. They know the rhythm of the process. They know what was different yesterday. They know which workaround has been used for months. They know which notification is ignored because it “is always there.”

That kind of information is rarely neatly in the documentation. But it is often essential.

Diagnosis, not a meeting

Troubleshooting is therefore not only technology. It is also listening to behavior. Of systems, processes, and people.

First determine the boundary. Then find the change. Then trace the dependencies. And only then go deep in a targeted way.

That also prevents teams from unnecessarily blaming each other.

In complex environments, everyone quickly points to someone else’s domain. Development looks at infrastructure. Infrastructure looks at application management. Management looks at security. Security looks at policy. Operations looks at “that new update.”

But a good diagnosis takes the emotion out of it.

Not: who caused this? But: under what circumstances does this behavior occur?

That makes it more businesslike. Calmer. And often faster too.

Context makes the difference

In troubleshooting I rely less on heroic searching and more on systematic bounding.

Not because logs are unimportant. Logs are important. Monitoring is important. Metrics, packet captures, traces, event viewers, audit logs, and dashboards can be enormously valuable. But only when you know what you are looking for.

Without context, a log file is mainly a collection of technical noise.

With context, it becomes evidence.

You don’t start by digging. You start by determining where to dig. And sometimes the best first step is not a command, query, or dashboard.

But a few simple questions:

What worked before? What has changed? Where does the problem stop? What is still stable?