A working engineer's argument that the observability stack has been solving the wrong problem for fifteen years.
It's 7 pm on a Friday and production is on fire. Your weekend is off to a great start. Obviously.
Your Payment service is throwing errors and your company is losing revenue. You open your log tool, filter by service and error level, and the interface returns 847 matching log lines from the last ten minutes.
Now what?
You scroll. Some of the errors are timeouts. Some are null pointers. The log tool has done its job — it found the logs, filtered them down, rendered them quickly. The part it hasn't done, the part still sitting on your shoulders while your friends are chilling, is everything that happens after you find the logs.
Which of these 847 errors are the same issue in different clothes? Which ones caused which? Is it the previous deployment from six hours ago, or the provider that's been flaky all week?
This is the real work of debugging a production system. It's not search. It's reasoning over what search returned.
I've been thinking about this gap while building a log analysis platform of my own, and I've come to a conclusion that sounds obvious once you say it out loud: engineers don't actually want to search logs. They want to know what broke, why, and what to check next.
Log tools have spent years getting really good at the first part and barely touching the second.
Logs Used to Be Hard To Find
I started my career grepping through log files on production servers. For most of the history of software, the hardest part of working with logs was finding them.
You had SSH access to a server, a tail -f command, and a folder full of rotating log files. If production broke, you started typing grep commands and hoped the right service was the one you were currently looking at.
Then came Splunk, Elasticsearch, Datadog and a dozen others. Each of them solved the same problem at scale: get all your logs in one place, index them, and give engineers a fast way to query across them.
If you've ever had to debug a production incident by SSHing into three servers in sequence, you understand why developers have come to love these tools.
This was the right optimization for its time. When the bottleneck is "I physically cannot find the relevant logs in reasonable time," the answer is a better index and a better query interface. Search was the product.
But the bottleneck has moved. Modern backend systems have structured logs, trace IDs, correlation tokens, and distributed tracing. You don't spend 40 minutes finding the relevant logs anymore — you spend 40 seconds. The tooling works. The search layer is solved well enough that it's no longer the thing standing between you and understanding what happened.
The pain point is everything after you have found the search results.
What Debugging Actually Looks Like
Here's what debugging a production incident actually looks like, once search has done its part.
You have your 847 log lines on the screen. The first thing your brain does — after mourning the Friday party you're missing — is group them.
You notice that 600 of the errors say TimeoutException and come from the same endpoint. You mentally file those as one incident. Another 150 are null pointers from a different service. You file those as a second incident. The remaining 97 are a mix you can't immediately classify, so you set them aside.
You've just done pattern detection. No tool helped you. You did it in your head, in about fifteen seconds, using muscle memory you've built up over years of reading logs. That's not always the case. I remember staring at a log for more than 30 minutes with no idea where to start.
Next, you start asking about causality. The timeouts started at 6:42pm. The null pointers started at 6:44pm. You already suspect the timeouts caused the null pointers — some downstream service couldn't get data it needed, and its null check was sloppy. You don't know this yet, but you're operating on that assumption. You're building a mental graph of "this caused that" using timestamps and your own prior knowledge of how the services connect.
Then you start asking about state. How many times has this happened in the last hour? Is it still happening right now, or did it stop? Has this error pattern shown up before? If it has, what did you do about it last time?
Then you start asking about context. Was there a deploy today? (This used to be my go-to question.) Is the database healthy? Is the upstream provider having issues? You open three other tabs — your deploy log, your database dashboard, your provider's status page — and start cross-referencing.
Eventually, you form a hypothesis. Something like: the upstream payment provider started failing at 6:42pm, which caused our payment service to time out, which caused the order service to receive null responses and crash. You don't know if you're right. You start checking.
None of this is search. Search gave you the 847 log lines. Everything above — the grouping, the causality reasoning, the state tracking, the context gathering, the hypothesis forming — is work your log tool didn't do and was never designed to do.
This is the real pain point. Not "find the logs," but "reason over the logs after you find them." And this is the part automation has never reached, because it requires judgment.
The reason this matters now is that judgment-shaped problems are exactly what the current generation of language models is usable for. Not perfect at. Usable for. Grouping similar errors, reconstructing a rough timeline, drafting a plausible cause, suggesting what to check next — these are tasks that were speculative three years ago and are now tractable. Imperfectly, but it works.
The observability stack hasn't absorbed this yet. That's the gap.
Why This didn't Get Fixed Earlier
Fair question at this point: if the gap is so obvious, why hasn't it been filled already?
It has been attempted. Three times, mostly.
The first direction was anomaly detection. Starting around 2015, several tools attempted to train models on historical logs to automatically flag unusual patterns. If a model has seen a million logs from your system behaving normally, it should be able to tell you when something looks off. However, one company's authentication logs don't look like another's. A model trained on Company A's logs will underfit badly on Company B's, and there's no clean way to fix that without labeled training data — "this is an incident, this is not" — which nobody had.
The second direction was rules-based alerting. Define thresholds, define patterns, get a notification when they're exceeded. This works, up to a point — and it’s how most production systems are still monitored today.
The problem is that rules are fragile. Your system evolves. Your traffic shifts. Your dependencies change. And suddenly your alerts are firing on noise or missing real incidents. The rules require constant maintenance, and the maintenance work is high-skill but low-status — and it doesn't get done. Most alerting systems I've worked with end up muted. On one team, we'd silenced four out of every five Slack alerts because they'd stopped meaning anything.
The third direction was generative summarization, which is where you might expect this to have been solved years ago. Language models that summarize text have existed in various forms for a long time. The honest answer is the models weren't good enough. GPT-2 could summarize a news article. It could not coherently reason across two hundred structured log entries and produce something an engineer would trust. The summaries were confident, vague, and frequently wrong — which is the worst possible combination for a debugging tool. I tried pasting logs into GPT-3.5 back when it launched. The summaries read like they knew what they were talking about. They didn't.
That changed. The current generation of models, given structured input and a targeted prompt, can produce summaries that are at least useful as a first draft of an engineer's reasoning. They're not replacing the engineer. They're saving the first twenty minutes of the engineer's work. That's enough of a shift to make the category viable.
None of the earlier approaches were wrong. Anomaly detection and rules-based alerting remain the backbone of every observability stack. What's new is that the summarization layer on top of them is finally manageable.
What the new stack looks like
So what does a log tool that takes summarization seriously actually look like?
Four layers, in order of execution.
First, structured log ingestion and search. Still essential. You need the raw logs indexed somewhere you can query them fast, because you always need to fall back to primary evidence. OpenSearch, Elasticsearch, ClickHouse, or similar — the specific tool matters less than the fact that the layer exists. The reasoning built on top of this is only as good as the search underneath it.
Second, deterministic pattern detection. Before any LLM runs, you group related errors using rules. Same service, same level, same exception type, same endpoint — that's one pattern. Three matches within five minutes is an incident. These are rules, not AI.
They're boring, they're predictable, and they're the foundation that lets the AI layer be useful. If you skip this step and let an LLM group errors directly, you get garbage, because the LLM has no stable way of deciding what "similar" means. In TraceRoot, this grouping is deterministic and repeatable — not probabilistic — so the same logs always produce the same incident.
Third, incident lifecycle. An incident is not an event. It's a stateful object with a first-seen timestamp, a last-seen timestamp, an event count, and a status (active or resolved). When a resolved pattern reappears, it reopens instead of creating a new incident. This sounds obvious, but most log tools don't model it — they just show you the events and leave the state in your head. You have to figure it out yourself.
On a Friday evening, this is how weekends die. I’ve watched a resolved bug come back five hours after the fix shipped, while my team was still celebrating. Incident summaries are cached and generated only when the underlying event count or last-seen timestamp changes.
Fourth, the reasoning layer. The LLM reads the grouped, structured incident context — metadata plus a sample of the matching logs — and produces a summary, a probable cause, and recommended checks. It is not reading raw logs. It is reasoning over structured input that has already been filtered, grouped, and annotated by the three layers beneath it. This is the difference between "pasting logs into ChatGPT" (which produces garbage) and "asking a model to reason over a specific incident pattern" (which works).
The order matters. Determinism before intelligence. The LLM sits at the top because it works dramatically better when the input has been shaped. Reverse the order and you get 2023-era demos that break on real data.
This is the architecture I'm building in TraceRoot — an incident-first log reasoning system. It's not the only way to do it.
What this Post Is Not Arguing
It's not arguing that search is obsolete. You still need fast, faceted search over raw logs — for ad hoc investigation, for debugging patterns the system hasn't seen before, for verifying what the AI layer tells you. The reasoning layer sits on top of search. It doesn't replace it. If the search layer is broken or slow, the reasoning layer is useless.
It's not arguing that LLMs should make decisions. The summary, probable cause, and recommended checks are a first draft of an engineer's reasoning — a starting point to accept, reject, or refine. The engineer is still the one paging the on-call, reverting the deploy, or opening the incident bridge. The LLM is triage assistance, not triage replacement.
And it's not arguing that this architecture is finished. The summarization layer works, but "works" is a low bar. The evaluations are still young, the prompt engineering is still fragile, and there are failure modes I haven't seen yet. I'll write about those as I run into them.
Back to Friday, 7pm
The payment service is throwing errors. Your weekend is in the balance.
Pattern detection in your head. Causality in your head. State tracking across browser tabs. Hypothesis from experience. Twenty minutes before you know what you're actually dealing with.
In the new stack, you open the incident — not the logs, the incident — and you see it already grouped. Same service, same exception type, same endpoint. Event count: 847. First seen at 6:42pm.
The summary says the upstream payment provider looks degraded, the recommended checks point you at the provider's status page and the last deploy. You don't take the summary on faith. You verify it against the raw logs, which are one click away. The twenty minutes of reconstruction becomes one minute of reading and two of validation.
The engineer is still doing the work. The stack is doing less of the wrong kind of work, so the engineer can do more of the right kind.
That's the shift. Not "AI replaces debugging." Not "observability gets automated." Just: understanding gets faster, and the weekend starts on time.
I'm building TraceRoot to explore this. Over the next few months I'll be writing about the specific design decisions — how to define an incident, why LLM summarization over raw logs fails until you fix the input shape, what storage architecture supports this workload, what the failure modes look like.
This post was the frame. The rest is the work.
If you've tried to solve the "reasoning over logs" problem — with LLMs, with rules, with anything — I'd genuinely like to hear what worked and what didn't. This is an area where everyone's still figuring it out, and I don't think there's a settled answer yet.