Splunk — Case Study

They weren't lacking data. They were lacking clarity.

How I helped engineers find answers — not data — during live outages.

RUMDashboard DesignEnterpriseSplunk

~30%

Faster diagnosis

~$1.4M

Estimated annual savings

3 min

Time to first action

Context

Every minute costs money.

Imagine: United.com crashes on Thanksgiving. Engineers open Splunk.

50+

Dashboards to check

10+

Tabs open at once

$108K

Lost per minute

When I inherited this dashboard, I saw the same panic every time. Engineers weren't slow because they were bad at their jobs. They were slow because the system made them hunt for answers.

They weren't missing data. They were drowning in it.

Before

The Scavenger Hunt

“Check Tab 1... not here. Tab 2... maybe? Tab 3... wait, go back to Tab 1 and cross-reference...”

Diagram showing the scavenger hunt workflow with multiple tabs and cross-referencing
Engineers had to jump between tabs, mentally correlating disconnected metrics

Average time to start fixing: 12 minutes

  • 15 metrics — engineers had to already know what to look for
  • 8 panels — nothing said “start here”
  • And this was just one tab.
Screenshot of the old RUM dashboard with 15 metrics across 8 panels
The old RUM was a library — everything was there, but nothing said 'start here.'

Process

3 months. No clean slate. Ship anyway.

I led design on this project, partnering with a principal designer from another team — we shared a layout system across products. No time for large-scale research. So I mined 2 years of support tickets and talked to engineers who'd lived through real outages.

Engineers don't need more data. They need less noise.

Insight

From library to control tower.

The pattern was clear: they weren't slow because they lacked information. They were slow because the system didn't tell them where to start.

The question wasn't “how do we show more?” It was “how do we show less — but the right things first?”

After

The Control Tower

Diagram showing the redesigned control tower approach
From scattered data to structured decision flow

Time to first action: 3 minutes

  • Health check at the top — Page views and Duration tell you immediately if something's wrong
  • URLs ranked by impact — see which pages are affected
  • Browser/OS breakdown grouped together — no more tab switching
Screenshot of the redesigned RUM dashboard with health check, impact ranking, and grouped breakdowns
I cut the default view to what drives decisions under pressure. Clarify first. Expand later.

The top of the screen answers: is something wrong? The middle answers: where? The bottom lets you drill in if you need to.

Dropdown

I changed the starting question.

The dashboard wasn't the only problem. Even finding the right metric was a guessing game.

Before: flat list of 15 technical metrics. After: organized by role (UX, Frontend, Backend, Network)
Before: 'Which of these 15 is right?' → After: 'What's your role? Start there.'

Before, the dropdown was a flat list of 15 technical metrics. Engineers had to already know the answer to pick the right one. I reorganized it by role: UX, Frontend, Backend, Network. Now engineers start from what they know — their job — and narrow from there.

Impact

30% faster. $1.4M saved.

30%

Faster diagnosis time

~$1.4M

Estimated annual savings from shorter outages

Engineers found problems 30% faster. For companies losing $100K+ per minute, that translates to ~$1.4M in annual savings from shorter outages.

But the real win wasn't the number. It was the confidence. Engineers stopped opening five tabs and guessing. They opened one screen and acted.

More data wasn't the answer. Clarity was.

Reflection

What I learned

Engineers are rarely slow because they lack data. They're slow because systems fail to surface the right signal first.

I didn't add features. I removed everything that didn't help someone act in the first 30 seconds.