Splunk — Case Study
They weren't lacking data. They were lacking clarity.
How I helped engineers find answers — not data — during live outages.
~30%
Faster diagnosis
~$1.4M
Estimated annual savings
3 min
Time to first action
Context
Every minute costs money.
Imagine: United.com crashes on Thanksgiving. Engineers open Splunk.
50+
Dashboards to check
10+
Tabs open at once
$108K
Lost per minute
When I inherited this dashboard, I saw the same panic every time. Engineers weren't slow because they were bad at their jobs. They were slow because the system made them hunt for answers.
They weren't missing data. They were drowning in it.
Before
The Scavenger Hunt
“Check Tab 1... not here. Tab 2... maybe? Tab 3... wait, go back to Tab 1 and cross-reference...”

Average time to start fixing: 12 minutes
- 15 metrics — engineers had to already know what to look for
- 8 panels — nothing said “start here”
- And this was just one tab.

Process
3 months. No clean slate. Ship anyway.
I led design on this project, partnering with a principal designer from another team — we shared a layout system across products. No time for large-scale research. So I mined 2 years of support tickets and talked to engineers who'd lived through real outages.
Engineers don't need more data. They need less noise.
Insight
From library to control tower.
The pattern was clear: they weren't slow because they lacked information. They were slow because the system didn't tell them where to start.
The question wasn't “how do we show more?” It was “how do we show less — but the right things first?”
After
The Control Tower

Time to first action: 3 minutes
- Health check at the top — Page views and Duration tell you immediately if something's wrong
- URLs ranked by impact — see which pages are affected
- Browser/OS breakdown grouped together — no more tab switching

The top of the screen answers: is something wrong? The middle answers: where? The bottom lets you drill in if you need to.
Dropdown
I changed the starting question.
The dashboard wasn't the only problem. Even finding the right metric was a guessing game.

Before, the dropdown was a flat list of 15 technical metrics. Engineers had to already know the answer to pick the right one. I reorganized it by role: UX, Frontend, Backend, Network. Now engineers start from what they know — their job — and narrow from there.
Impact
30% faster. $1.4M saved.
30%
Faster diagnosis time
~$1.4M
Estimated annual savings from shorter outages
Engineers found problems 30% faster. For companies losing $100K+ per minute, that translates to ~$1.4M in annual savings from shorter outages.
But the real win wasn't the number. It was the confidence. Engineers stopped opening five tabs and guessing. They opened one screen and acted.
More data wasn't the answer. Clarity was.
Reflection
What I learned
Engineers are rarely slow because they lack data. They're slow because systems fail to surface the right signal first.
I didn't add features. I removed everything that didn't help someone act in the first 30 seconds.
