ThousandEyes App Insights

Project

ThousandEyes App Insights

My Role

Principal Product Designer

Category

Data Visualization

Year

2020

Summary

Launched a new product line to detect application outages.

Overview

Application Insights is a major evolution of Cisco ThousandEyes Internet Insights platform, expanding its scope from network-level observability to application-layer outage intelligence. It introduced real-time visibility into SaaS application health (e.g. Salesforce, M365, Slack), leveraging collective telemetry from global vantage points. This initiative wasnt just a product enhancement, it was a strategic launch that redefined how enterprises monitor and respond to third-party application outages.

As the lead designer, I drove the product experience from concept through final execution, including a new Sankey-style topology visualization, a fully redesigned global outage map, and modular embed points throughout the platform.

Impact

  • Introduced application-layer intelligence into Internet Insights: HTTP, DNS, SSL outages for global SaaS apps
  • Launched a new Sankey visualization to depict agent-to-application flow and outage root cause
  • Redesigned the global outage map, resulting in a 125% increase in website traffic
  • Launched an award-winning product line, delivered end-to-end from research to release
  • Embedded App Insights across the ThousandEyes platform: Dashboards, Alerts, Reports, Integrations, etc.
  • Public-facing outage map became a go-to reference for app outages - similar to DownDetector, but powered by telemetry. (see: ThousandEyes Outage Map)

Problem

Organizations previously faced challenges such as:

  • Delayed detection of SaaS disruptions, relying on vendor status pages or social media
  • No way to correlate network and application layer data
  • Trouble isolating root cause, especially when outages originate outside the corporate network perimeter
  • Limited data for vendor escalation, SLA enforcement, or vendor selection

New capabilities were required to:

  • Provide credible, near real-time SaaS outage detection
  • Visualize scale, scope, and error type (DNS, SSL, HTTP, etc.)
  • Enable cross-layer correlation to avoid unnecessary troubleshooting
  • Empower communication with stakeholders and providers via trustworthy, third‑party data

Persona

Global NOC Manager / Enterprise IT Operations Lead

ObjectivesOversee global availability of critical SaaS applications for distributed users
Need fast insight: internal vs external issue
Prefer data-driven escalation tools for vendors or internal stakeholders
Rely on dashboards and maps for situational awareness and decision‑making
Pain PointsSlow resolution cycles due to lack of real-time confirmation
Complexity in coordinating response across locations and IT teams
Skepticism about vendor reports that only surface after social sentiment rises
Lack of triangulated data to pinpoint root-cause

User Journey

Scenario: A Network Engineer observes unusual support tickets reflecting degraded app functionality.

  • Instead of manually verifying vendor status pages, they open Internet Insights Overview. There, a purple node on the outage map and application-only timeline indicates a SaaS service disruption.
  • Clicking through into Views → Application Outages reveals a topology (Sankey) mapping the impact from region → AS/provider → app → error type.
  • Armed with this insight, they communicate internally and externally with confidence, triggering alerts and sharing validated outage snapshots via API or link.
  • Post‑incident dashboards help evaluate SLA compliance, vendor reliability trends, and inform future vendor choices.

Information Architecture

The application-integrated IA lifts the previously network-centric Internet Insights module into layered visibility. The Overview dashboard was redesigned to show two outage types (application and network) in a unified timeline and map. Navigation was extended under Internet Insights → Views, adding an Application Outages view that includes timeline filters, topology (Sankey flow), map and table tabs. Alerts are fully configurable with filters by application category, geography, provider, or severity. Reports and dashboards also integrate application outage metrics—with shareable snapshots and direct links to Internet Insights dashboards.

IA Changes

  • Overview Tab:
    • Outage counts (total, application, network).
    • Timeline of outage alerts over past 24h.
    • Geographic map with purple (application) & red (network) markers
  • Views Tab -> Application Outages:
    • Timeline: filters by app, domain, error type, geography, with toggle to show only your impacted tests
    • Topology tab: Sankey style flows (Agent country → Application → AS/provider → Error type)
    • Map tab: geographic distribution of outages by test locations
    • Table tab: detailed error type breakdown, % agents affected, affected servers
  • Alerts, Dashboards & Sharing:
    • Allows creation of alert rules for application or network outages by severity, geography, app category, using webhooks/email (e.g., Slack, ServiceNow)
    • Dashboards and Reports widgets show application outages
    • Save and share snapshots of outages for internal and external communication

Design

Designing the Sankey visualization for Application Outages was an iterative, collaborative, and data-driven process. The goal was to create an intuitive way for users to see where an outage started, which nodes were affected, and where it ended, even when dealing with complex, large-scale datasets.

The process unfolded in multiple phases, starting from initial sketches to high-fidelity prototypes integrated with the ThousandEyes design system, and tested against real production data.

Defining the Use Case

  • The Sankey diagram needed to show start and end servers for an outage, visualize intermediate network/application nodes affected, and scale from a handful of flows to hundreds of nodes without overwhelming the viewer.
  • Network Engineers and NOC Managers revealed:
    • They needed both summary and drill-down views.
    • Color coding and grouping were critical for fast triage.

From Ideation to Final Design

  • Initial Sketches: Focused on a simple 3-column flow (start → affected node(s) → end).
  • Mid-fidelity prototypes: Added node grouping logic and clustering for heavy data sets.
  • High-fidelity final: Polished with the ThousandEyes design system’s typography, spacing, and interaction patterns.
  • Final integration: Shipped with Overview and Application Outages View, fully connected to real-time outage data.

Overview Map

Overview
Overview
Overview
Overview Hover

Views: Sankey Visualization

Sakney
Sankey
Sankey
PathViz

Execution

Design System Specs & Alignment

I worked closely with our design system to ensure the Sankey visualization aligned with ThousandEyes’ pattern library. I extended the system with new patterns for network path visualizations, making sure colors met accessibility contrast standards. Interactive states (hover, click, and selection) were designed to match the rest of the platform, and I documented these new components so they could be reused in future projects.

  • Colors: Worked with the design system to select a palette that:
    • Distinguishes application outages (purple) from network outages (red).
    • Provides enough contrast for accessibility (WCAG AA).
  • Interactions:
    • Hover: Show node details (latency, error type, affected agents).
    • Click: Filter the entire diagram to just that node’s flows
    • Drag/Pan: Navigate larger datasets smoothly.
  • Performance:
    • Optimized rendering for large datasets with canvas-based drawing over SVG for better performance.
  • Labels & Tooltips:
    • Added truncation + full text on hover for long provider or location names.

Building & Testing Visualizations

We partnered with engineering to build a high-performance Sankey diagram capable of handling large datasets. I collaborated on integrating dynamic filtering, node grouping, and responsive layouts, then tested with simulated and real outage data. These tests confirmed the diagram scaled from a handful to hundreds of nodes while remaining clear and responsive.

Iterating with Data

Collaboration with Engineering & Data Science:

  • Partnered with data science to define data aggregation rules for different outage volumes.
  • Engineering created mock APIs to simulate varying node and path counts.
  • Tested with Low volume (5–10 paths), Medium volume, High volume (100+ paths)
Low Volume
High Volume

Testing with Users

We ran a closed beta with select enterprise customers and internal NOC teams. I observed sessions, conducted interviews, and reviewed telemetry to understand how users worked under time pressure. Based on feedback, we refined node spacing, reordered filters for quicker access, and enhanced tooltips to highlight the most critical details. This beta phase validated the design and built momentum for launch.

Ran iterations with internal users and beta customers:

  • Scenario 1: Isolating a regional issue
    • User needs to quickly identify if a SaaS outage impacts all regions or just one geography.
    • Improvement: Added region filters for the Sankey
  • Scenario 1: Isolating a regional issue
    • Paths cross multiple ASNs/providers.
    • Improvement: Introduced provider grouping for better pattern recognition.
  • Scenario 3: SLA evidence gathering
    • Need to export or snapshot for vendor escalation.
    • Improvement: Added download/export with snapshot state saved.