- Published on
Measuring What Matters
Hello there π, I recently gave a talk at React Mumbai about frontend observability - specifically about how the way most teams monitor their frontends today leaves a massive blind spot in understanding what users are actually experiencing. I thought I would turn that talk into a written post for those who prefer reading over watching.
This post is about moving beyond error counts and server-side metrics to something that actually tells you whether your users can get things done. We'll go from the basics of SLIs and SLOs, all the way to building tooling and automation on top of them.
If you'd prefer to watch the talk instead, here's the recording.
Everything Looked Green⦠but Users Were Stuck
Let me start with a story.
One fine afternoon, we noticed a sudden spike in support tickets. At the same time, there was noise on social media about something not working. So we did what any good engineering team does - we pulled up all our dashboards.
Error rates? Normal. Latency charts? Green. CPU and memory? All fine. API success rates? Looking healthy. Every single dashboard we had was telling us that everything was absolutely fine.
After some digging, we figured out what was going on. There was a particular scenario where a specific account type, combined with a certain feature flag being enabled, caused the checkout flow to break. When users clicked the "Continue" button, it would just keep spinning. Forever.
The thing is, no request was ever sent to the server. The button was stuck in a loading state on the frontend itself. No request meant no error. No error meant nothing showed up on any of our dashboards.
That's when it hit us - our monitoring was only listening to what the server was telling us. It had no idea what our users were actually experiencing.
Why Monitoring Matters (Especially for Frontends)
This isn't just a "we had a bad day" story. The problem is more fundamental than that.
Think about how we build software today. We ship fast. We run multiple feature flags, experiments, and deploys - sometimes multiple times a day. That's great for iteration speed, but it also means the surface area for regression is constantly growing. Every time you push something to production, there's a chance something breaks in a way you didn't anticipate.
And here's the thing about frontends - they are really good at hiding failure. You've got error boundaries catching crashes. Retry logic silently retrying failed requests. Loading spinners masking slow responses. Fallback UIs gracefully degrading broken components.
All of that is great for user experience. But it also means that things can go very wrong without showing up as a hard error anywhere - not on the frontend, and not on the backend.
When you combine frequent changes with frontends that hide failure well, you get a gap. A gap between what your system thinks is happening and what your users are actually going through. What we want is fast, accurate signals of real user pain.
What Most Teams Monitor Today
If you look at how most teams monitor their frontends, the picture is pretty standard:
| Signal | What it's good for |
|---|---|
| Error counts (Sentry, logs), JS exceptions | Debugging, deep investigation, identifying noisy errors, keeping an eye on error rates |
| Backend 5xx, latency, CPU, memory | System health, overload detection, service failure alerting ("is this node overloaded?") |
These tools are genuinely useful. Sentry gives you stack traces, groups similar errors together, and tells you exactly which line of code threw. Infrastructure metrics tell you when a service is under pressure.
But here's the problem - none of these tools tell you what the user was trying to do when something went wrong. They can tell you that a function threw an error, but not that the user was trying to complete a checkout. They tell you about system behaviour, not user experience.
Why Error Counts Don't Tell You the Whole Story
Let's say you have 100 errors in the past 5 minutes. Is that bad?
Well, if you had 10 lakh requests in those 5 minutes, 100 errors is actually pretty good. But if you only had 120 requests and 100 of them errored out - that's catastrophic.
Raw error counts are completely dependent on how much traffic your application is receiving. When traffic doubles because of a marketing campaign, your error counts will almost always go up even if nothing in the system has changed. On the flip side, a serious issue that's silently failing on an important flow might generate so few errors that you don't even notice it.
And that's the core issue:
A spike in errors β users are blocked. A low error rate β users are happy.
Error tools don't tell you what percentage of users are blocked, which flows are broken, or which customers and regions are affected. They're focused on system internals, not on what users are actually trying to accomplish.
SLIs and SLOs: From Error Counts to Success Rates
This is where SLIs and SLOs come in. You might have heard of these in the context of backend services and SRE, but they apply just as well to frontend monitoring.
The two core concepts are simple:
SLI (Service Level Indicator) - A measurement of something you care about. Example: The percentage of successful responses for POST /rest/api/checkout.
SLO (Service Level Objective) - A target for that measurement over time. Example: The checkout success rate should be β₯ 99.9% over 30 days.
The formula is straightforward:
success_rate = successful_events / total_events * 100
So for a REST API, you might define it like this:
SLI_checkout_api = (2xx + 3xx) / (2xx + 3xx + 4xx + 5xx) for POST /rest/api/checkout
SLO: SLI_checkout_api β₯ 99.9% over 12 hours
The difference between this and raw error counts is significant. You're no longer staring at error spikes in isolation. You're asking: are we keeping our promise to our users about how often this operation works?
This shift from raw counts to percentages of successful events out of total events is important because it normalises for traffic. It turns a vague feeling of "things don't seem right" into a concrete statement like "our checkout SLI has dropped to 99.5%, something is wrong."
But Generic Endpoint SLIs Still Miss the Point
Now, if you define SLIs on your REST API endpoints like we just did, you're already in a much better place than just tracking error counts. You'll know when something is going wrong with your checkout backend, and you'll see the success rate dropping in a meaningful way.
But there's still a missing piece.
You're measuring the health of a technical endpoint, not the health of a user journey. In most cases, a single endpoint is going to be shared across multiple different UIs. The same checkout API might be used by:
- Your desktop web checkout
- Your mobile web checkout
- Your mobile app
- Internal admin workflows
- Maybe some bulk or wholesale flows
So when your SLI_checkout_api drops, you know something about checkout is unhappy. But you don't actually know which user task is broken or who exactly is impacted.
The problem is that endpoint health doesn't map one-to-one to user intent. When the SLI dips, you have to do a lot of extra work to figure out which specific user experience is failing and how bad it is.
Endpoint SLO dip β "Something with checkout is wrong" (vague)
Task SLO dip β "Users can't successfully checkout" (actionable)
Model Your SLOs After User Tasks
This is the key idea: monitor what users actually do, not what your APIs do.
Users don't care about your microservices. They don't care about your endpoint latency or your CPU usage. They care about whether they can complete the task they came to your platform to do - checkout, sign up, upload a file, search for something.
Think in terms of tasks: "checkout", "browse product", "add to cart", "submit form". These are real units of work that your users care about. They need to be reliable.
The trick is in how you define your measurements. Instead of measuring API responses, you define SLIs around these key user tasks. Whenever someone attempts to complete a checkout, that's a checkout task attempt. It either succeeds or it fails.
Each attempt becomes a single semantic event that you can classify as succeeded or failed. Under the hood, that single task might involve five different services and three different API endpoints. But from the user's perspective, it's one simple question: did my checkout work or not?
The benefits of this approach:
- User intent is baked into your metrics. You know exactly which task is in trouble when something starts failing.
- SLOs become meaningful to everyone. An SLO that says "at least 99.95% of checkout attempts should succeed" is a promise you can explain to anyone - product managers, support engineers, leadership. Everyone understands "checkout is failing."
- They're directly tied to business outcomes. A checkout failure is lost revenue. A sign-up failure is a lost user. These aren't abstract numbers anymore.
Here's what a task-based SLI and SLO looks like:
SLI_checkout_task = success("user is able to checkout") /
(success("user is able to checkout") + fail("user is able to checkout"))
SLO: SLI_checkout_task β₯ 99.95% over 30 days
What If the Flow Isn't Broken⦠Just Slow?
So far we've only talked about hard success and failure - checkout calls that don't go through, sign-ups that error out. But in practice, a lot of user pain doesn't come from things being completely broken. It comes from things being too slow.
A checkout that takes 15 seconds to complete is still technically a success. But from the user's perspective, it might as well be down. People will abandon the flow, refresh the page, or go to your competitor.
This is a blind spot. If your task SLI only measures "did it eventually succeed or not?", you could proudly report that 99.99% of checkouts are succeeding while users are sitting there staring at a spinner for 10 seconds on every attempt. The SLI looks great. The user experience is terrible.
Those slow successes don't increment any failure counter, so they won't show up in your error-based monitoring at all. That's why you need to think about performance as a part of reliability, not as a separate nice-to-have optimisation.
When we talk about the health of a task, we need to ask two questions:
- Did it succeed?
- Was it fast enough?
If you only measure the first one, you miss a whole class of incidents where the system is technically working but practically unusable.
Performance SLOs: Speed as Part of Reliability
The simplest way to do this is to measure the duration of each task. Start a timer when the user initiates the action (clicks the submit button, starts the upload) and stop it when the operation completes - whether it succeeds or fails.
Then you build a distribution on top of that. What's the median time? What's the 95th percentile? The 99th? And you set SLOs around those percentiles:
- 95% of all checkout tasks should complete in under 1 second
- 99% of all checkout tasks should complete in under 2 seconds
The key idea here is that a task that is too slow is effectively a failure. Speed is a critical part of reliability. When you say "checkout is reliable", you actually need two things: it works, and it works fast enough that users don't abandon the flow.
What If the Issue Is Cohort-Specific?
Quick quiz: what's special about these dates?
- October 19, 2025
- July 30, 2024
- June 13, 2023
These are dates when AWS US-East had significant issues. If your systems relied on infrastructure in that region, you felt it. A lot of engineers, at a lot of different companies, had very interesting days.
This brings up an important problem. Even if you've modelled your SLOs around user tasks - which gets you much closer to what users actually feel - there's another dimension you need to handle. Sometimes the problem isn't global. It's specific to a subset of your users.
Imagine you're tracking a task-based SLI for checkout, and globally it's at 99.95%. Looks healthy, right? Your SLO of 99.9% is comfortably met. But if you slice that same metric by region, you might find that the EU region is sitting at 97% - way below your target.
The reason your global SLO looks fine is because other regions like US and APAC have very high success rates that are pulling the average up. On aggregate, everything looks green. But for every user in the EU, checkout is effectively broken.
This can happen for all sorts of reasons - a misconfigured CDN edge, a cloud provider having issues in a specific region, a third-party payment provider being down in one geography.
Slicing SLOs: By Region and By Tier
Once you have good task-based SLIs, you can ask a better question: who exactly is impacted when things go wrong?
By Region:
SLI_checkout_task{region="US"}
SLI_checkout_task{region="EU"}
SLI_checkout_task{region="APAC"}
Global SLO might still be green while one region is red. This helps you spot: "Checkout is failing mostly in EU."
By Tier:
SLI_checkout_task{tier="premium"}
SLI_checkout_task{tier="standard"}
Premium customers may have stricter SLOs. This lets you say: "Premium checkout is breaching, free tier is fine."
You can combine these dimensions however makes sense for your system. Maybe you care about enterprise vs. free-tier users. Maybe you segment by device type. The point is that a single global number can hide a lot of pain for specific cohorts of users.
For example, here's what a system health dashboard might look like when you slice by both task and cohort:
| Checkout | Sign-up | Search | Upload | |
|---|---|---|---|---|
| Global | β 99.98% | β 99.99% | β 99.97% | β 99.96% |
| US | β 99.99% | β 99.98% | β 99.99% | β 99.95% |
| EU | β οΈ 99.5% | β 99.96% | β οΈ 99.2% | β 99.98% |
| APAC | β 99.97% | β 99.95% | β 99.98% | β οΈ 99.4% |
| Enterprise | β 99.99% | β 99.99% | β 99.99% | β 99.99% |
| Standard | β 99.96% | β 99.95% | β 99.97% | β 99.94% |
If a non-engineer can look at this table and immediately understand that Checkout and Search are having issues in EU, you're doing your job well.
UI Task Events: The Frontend Implementation
This might sound like a lot of complex infrastructure, but the frontend implementation is actually quite simple. It's basically an analytics event.
Every important UI task has a lifecycle: started β succeeded / failed. You start a timer when the task begins, and then fire a success or failure event when it completes. Here's what the code looks like:
const handleCheckoutFormSubmit = async () => {
startTask({ task: 'checkout' })
try {
await api.checkout(payload)
succeedTask({ task: 'checkout' })
} catch (error) {
failTask({ task: 'checkout', errorType: classifyError(error) })
}
}
That's it. For every important user task in your UI - checkout, file upload, form submission, whatever matters for your product - the frontend fires a small lifecycle event: start, then succeed or fail.
What you end up with is not a jungle of low-level logs, but one clean semantic event per user attempt, telling you:
- What task was attempted (checkout, upload, sign-up)
- Whether it succeeded or failed
- How long it took (duration in milliseconds)
- Extra context (error messages, status codes, region, user tier)
This keeps frontend instrumentation lightweight and systematic. You can send these events to whatever observability tool you use - SignalFx, Datadog, Grafana (if you want open source), or anything else that supports custom metrics.
Building Tooling on Top of SLOs
Up to this point, we've talked about how task-based SLOs are better signals - they're user-centric, region-aware, tier-aware, and performance-aware. But here's where it gets really interesting.
Once you have these SLIs and SLOs set up in your monitoring system, they become more than just pretty graphs you look at occasionally. They become a platform. And at that point, you can start automating what is currently manual.
Safer Rollouts with Feature Flags
If you're building something new, you probably don't want to ship it to 100% of your users all at once. That's what feature flags are for - you roll out to 1%, check monitoring, then 5%, 50%, 100%.
But that checking step is usually manual. Someone looks at the dashboards and decides "looks good, let's increase the rollout."
Many feature flagging tools support a pattern where you can define SLOs as guardrail metrics. When you roll out new code behind a feature flag, the tool monitors your SLOs. If the SLO dips below the threshold, the rollout is automatically paused or rolled back.
This gives you a very tight feedback loop. You can safely expose a risky change to a small set of users and let the SLOs act as an automatic safety net. No more staring at dashboards manually during every rollout.
SLO-Based Alerting
Traditional alerting typically relies on threshold-based signals - "alert me when the error rate is above X" or "page me when there are more than 100 errors in 5 minutes."
This is incredibly noisy. You get paged for flaky, low-impact issues. You get woken up at 3 AM for spikes that resolve themselves. Eventually, the on-call engineer starts ignoring alerts because most of them aren't actionable. And then one day, a real incident happens and nobody's looking.
With SLOs, you can flip this around. Instead of alerting on raw error thresholds, you alert when the SLI has dipped by a meaningful amount. And you can do this at different granularities:
- Alert when the global SLI dips by 1%
- Alert when a specific region's SLI dips by 1%
- Alert when enterprise-tier users' SLI starts breaching
This way, you get paged for real user-facing problems. Every alert is actionable - it tells you which task is in trouble, where, and roughly how bad it is.
Alert Priority Based on Rate of Decline
When you start setting up alerts, it's tempting to treat every SLO dip as equally urgent. But in practice, that'll burn you out.
What you really want is to prioritise alerts based on how fast the SLO is declining, not just whether it dipped:
- Steep, sudden drop (e.g., checkout going from 99.9% to 95% in 15 minutes) β Critical. Page now. Even if it's 3 AM, you're going to want to deal with this before the morning.
- Slow, gradual decline (e.g., SLI has been slowly degrading over the past few weeks) β Lower priority. Investigate during business hours. This is a P2, not a P0.
The idea is to look at both the current level of the SLI and the rate of change. A fast burn demands immediate attention. A slow burn is a signal that something needs investigation, but it's not an emergency.
Automatic Incidents from SLO Breaches
If you have an incident management system, creating incidents from SLO breaches is a great thing to automate. Instead of relying on a human staring at a dashboard and deciding whether things are bad enough to fire an incident, you set up a rule: when a task SLO breaches, automatically open an incident.
You can prefill the incident with:
- The task name ("checkout")
- Impacted cohorts ("EU region, enterprise tier")
- Links to the relevant monitoring dashboards
This shortens the time from detection to mitigation significantly. You're no longer spending 10 minutes deciding "is this bad enough to create an incident?" - the system does it for you.
Using AI to Guess "What Broke?"
This is one of the more interesting use cases I've seen. When something goes wrong, you typically have a pattern of signals - maybe the checkout success SLI is dropping, and the performance SLO for a related task is also dipping, and it's only happening in one region.
Humans can look at all of that and reason about what could have happened, but that takes time. And during an incident, time is exactly what you don't have.
With well-labelled SLOs, you can feed that pattern into an AI system and ask it to come up with hypotheses. Given signals like "checkout failing in APAC region" and "only affecting free-tier users", AI can propose likely causes:
- "Multiple tasks failing in APAC β maybe a widespread regional outage"
- "Only free users affected β maybe a feature flag for that tier is misconfigured"
The key insight is that AI becomes effective here because SLOs are already modelling detailed user tasks rather than generic error counts. The structured, semantic nature of task-based SLOs gives AI much better signal to work with, especially when combined with historical incident data and recent deployment information.
How to Get Started
This whole thing might sound like a massive infrastructure project, but it doesn't have to be. You don't need a perfect SLO program to get value from this. Here's how I'd suggest getting started:
Pick 1β3 critical user tasks. Choose the ones where a failure would hurt the most - checkout, sign-up, whatever is core to your product.
Instrument them with task events in your frontend. Add the
startTask/succeedTask/failTaskpattern for those tasks. This is just analytics events - you probably already have the infrastructure for this.Turn those into simple success/fail metrics. Push the events to your observability tool and create a basic success rate metric.
Define a basic SLO and alert on obvious breaches. Start with something simple like "checkout success rate β₯ 99.9% over 30 days" and set up an alert for when it dips.
Iterate. Once the basics are working, add performance SLOs (latency), then add region and tier segmentation, then start automating (feature flag guardrails, automatic incidents, etc.).
Even one well-instrumented task with a clear SLO and a simple dashboard will catch issues earlier than a wall of error count charts. And it will catch them in a way that aligns with what your users actually experience.
Wrapping Up
The journey looks like this:
Raw error counts β Endpoint SLIs β Task-based SLOs β Latency SLOs β Cohort-aware SLOs β Automation and tooling
You don't have to do all of this at once. Start small. Pick one task. Instrument it. Define an SLO. Build a dashboard. You'll be surprised how much visibility even that one signal gives you.
The fundamental shift is simple: stop asking "is the system healthy?" and start asking "can users get things done?"
If you'd like to watch the full talk, here's the Loom recording.
Did you like what you read?
You can follow me on Twitter or LinkedIn to get notified when I publish new content.
