Radhika Sawhney Portfolio

Project

ThousandEyes App Insights

My Role

Principal Product Designer

Year

2020

Summary

Launched a new product line to detect application outages.

Overview
Problem
Persona
User Journey
Information Architecture
Design
Execution
Testing with Users

Overview

Application Insights is a major evolution of Cisco ThousandEyes Internet Insights platform, expanding its scope from network-level observability to application-layer outage intelligence. It introduced real-time visibility into SaaS application health (e.g. Salesforce, M365, Slack), leveraging collective telemetry from global vantage points. This initiative wasnt just a product enhancement, it was a strategic launch that redefined how enterprises monitor and respond to third-party application outages.

As the lead designer, I drove the product experience from concept through final execution, including a new Sankey-style topology visualization, a fully redesigned global outage map, and modular embed points throughout the platform.

Impact

Introduced application-layer intelligence into Internet Insights: HTTP, DNS, SSL outages for global SaaS apps
Launched a new Sankey visualization to depict agent-to-application flow and outage root cause
Redesigned the global outage map, resulting in a 125% increase in website traffic
Launched an award-winning product line, delivered end-to-end from research to release
Embedded App Insights across the ThousandEyes platform: Dashboards, Alerts, Reports, Integrations, etc.
Public-facing outage map became a go-to reference for app outages - similar to DownDetector, but powered by telemetry. (see: ThousandEyes Outage Map)

Problem

Organizations previously faced challenges such as:

Delayed detection of SaaS disruptions, relying on vendor status pages or social media
No way to correlate network and application layer data
Trouble isolating root cause, especially when outages originate outside the corporate network perimeter
Limited data for vendor escalation, SLA enforcement, or vendor selection

New capabilities were required to:

Provide credible, near real-time SaaS outage detection
Visualize scale, scope, and error type (DNS, SSL, HTTP, etc.)
Enable cross-layer correlation to avoid unnecessary troubleshooting
Empower communication with stakeholders and providers via trustworthy, third‑party data

Persona

Global NOC Manager / Enterprise IT Operations Lead

Objectives	Oversee global availability of critical SaaS applications for distributed users
	Need fast insight: internal vs external issue
	Prefer data-driven escalation tools for vendors or internal stakeholders
	Rely on dashboards and maps for situational awareness and decision‑making

Pain Points	Slow resolution cycles due to lack of real-time confirmation
	Complexity in coordinating response across locations and IT teams
	Skepticism about vendor reports that only surface after social sentiment rises
	Lack of triangulated data to pinpoint root-cause

User Journey

Scenario: A Network Engineer observes unusual support tickets reflecting degraded app functionality.

Instead of manually verifying vendor status pages, they open Internet Insights Overview. There, a purple node on the outage map and application-only timeline indicates a SaaS service disruption.
Clicking through into Views → Application Outages reveals a topology (Sankey) mapping the impact from region → AS/provider → app → error type.
Armed with this insight, they communicate internally and externally with confidence, triggering alerts and sharing validated outage snapshots via API or link.
Post‑incident dashboards help evaluate SLA compliance, vendor reliability trends, and inform future vendor choices.

Information Architecture

The application-integrated IA lifts the previously network-centric Internet Insights module into layered visibility. The Overview dashboard was redesigned to show two outage types (application and network) in a unified timeline and map. Navigation was extended under Internet Insights → Views, adding an Application Outages view that includes timeline filters, topology (Sankey flow), map and table tabs. Alerts are fully configurable with filters by application category, geography, provider, or severity. Reports and dashboards also integrate application outage metrics—with shareable snapshots and direct links to Internet Insights dashboards.

IA Changes

Overview Tab:
- Outage counts (total, application, network).
- Timeline of outage alerts over past 24h.
- Geographic map with purple (application) & red (network) markers
Views Tab -> Application Outages:
- Timeline: filters by app, domain, error type, geography, with toggle to show only your impacted tests
- Topology tab: Sankey style flows (Agent country → Application → AS/provider → Error type)
- Map tab: geographic distribution of outages by test locations
- Table tab: detailed error type breakdown, % agents affected, affected servers
Alerts, Dashboards & Sharing:
- Allows creation of alert rules for application or network outages by severity, geography, app category, using webhooks/email (e.g., Slack, ServiceNow)
- Dashboards and Reports widgets show application outages
- Save and share snapshots of outages for internal and external communication

Design

Designing the Sankey visualization for Application Outages was an iterative, collaborative, and data-driven process. The goal was to create an intuitive way for users to see where an outage started, which nodes were affected, and where it ended, even when dealing with complex, large-scale datasets.

The process unfolded in multiple phases, starting from initial sketches to high-fidelity prototypes integrated with the ThousandEyes design system, and tested against real production data.

Defining the Use Case

The Sankey diagram needed to show start and end servers for an outage, visualize intermediate network/application nodes affected, and scale from a handful of flows to hundreds of nodes without overwhelming the viewer.
Network Engineers and NOC Managers revealed:
- They needed both summary and drill-down views.
- Color coding and grouping were critical for fast triage.

From Ideation to Final Design

Initial Sketches: Focused on a simple 3-column flow (start → affected node(s) → end).
Mid-fidelity prototypes: Added node grouping logic and clustering for heavy data sets.
High-fidelity final: Polished with the ThousandEyes design system’s typography, spacing, and interaction patterns.
Final integration: Shipped with Overview and Application Outages View, fully connected to real-time outage data.

Overview Map

1 / 4

Views: Sankey Visualization

1 / 4

Execution

Design System Specs & Alignment

I worked closely with our design system to ensure the Sankey visualization aligned with ThousandEyes’ pattern library. I extended the system with new patterns for network path visualizations, making sure colors met accessibility contrast standards. Interactive states (hover, click, and selection) were designed to match the rest of the platform, and I documented these new components so they could be reused in future projects.

Colors: Worked with the design system to select a palette that:
- Distinguishes application outages (purple) from network outages (red).
- Provides enough contrast for accessibility (WCAG AA).
Interactions:
- Hover: Show node details (latency, error type, affected agents).
- Click: Filter the entire diagram to just that node’s flows
- Drag/Pan: Navigate larger datasets smoothly.
Performance:
- Optimized rendering for large datasets with canvas-based drawing over SVG for better performance.
Labels & Tooltips:
- Added truncation + full text on hover for long provider or location names.

Building & Testing Visualizations

We partnered with engineering to build a high-performance Sankey diagram capable of handling large datasets. I collaborated on integrating dynamic filtering, node grouping, and responsive layouts, then tested with simulated and real outage data. These tests confirmed the diagram scaled from a handful to hundreds of nodes while remaining clear and responsive.

Iterating with Data

Collaboration with Engineering & Data Science:

Partnered with data science to define data aggregation rules for different outage volumes.
Engineering created mock APIs to simulate varying node and path counts.
Tested with Low volume (5–10 paths), Medium volume, High volume (100+ paths)

Testing with Users

We ran a closed beta with select enterprise customers and internal NOC teams. I observed sessions, conducted interviews, and reviewed telemetry to understand how users worked under time pressure. Based on feedback, we refined node spacing, reordered filters for quicker access, and enhanced tooltips to highlight the most critical details. This beta phase validated the design and built momentum for launch.

Ran iterations with internal users and beta customers:

Scenario 1: Isolating a regional issue
- User needs to quickly identify if a SaaS outage impacts all regions or just one geography.
- Improvement: Added region filters for the Sankey
Scenario 1: Isolating a regional issue
- Paths cross multiple ASNs/providers.
- Improvement: Introduced provider grouping for better pattern recognition.
Scenario 3: SLA evidence gathering
- Need to export or snapshot for vendor escalation.
- Improvement: Added download/export with snapshot state saved.

Related Projects

Quid Apps

Launched a new 'lite version' of Quid based on highly repeatable use cases.

Quid Natural Language Search

Enabling users to search in natural language pre-GPT era.

Quid Faceted Search

Redesigning the Search experience for users users to find relevant documents.