Visualizing Data Timeliness at Airbnb

by Chris Williams, Ken Chen, Krist Wongsuphasawat, and Sylvia Tomiyama

Imagine you are a business leader ready to start your day, but you wake up to find that your daily business report is empty — the data is late, so now you are blind.

Over the last year, multiple teams came together to build SLA Tracker, a visual analytics tool to facilitate a culture of data timeliness at Airbnb. This data product enabled us to address and systematize the following challenges of data timeliness:

  1. When should a dataset be considered late?
  2. How frequently are datasets late?
  3. Why is a dataset late?

This project is a critical part of our efforts to achieve high data quality and required overcoming many technical, product, and organizational challenges in order to build. In this article, we focus on the product design: the journey of how we designed and built data visualizations that could make sense of the deeply complex data of data timeliness.

Yes, Data Can Be Late

To avoid blinding the business, it is critical to deliver data in a timely manner. However, this can be difficult to do because the journey from data collection to final data output typically requires many steps. At Airbnb — and anywhere with large-scale data processing pipelines — “raw” datasets are cleaned up, merged, and transformed into structured data. Structured data then powers product features and enables analytics to inform business decisions.

To ensure timeliness of the final output data, we aim to have owners of each intermediate step commit to Service Level Agreements (SLAs) for the availability of their data by a certain time. For example, the dataset owner promises that the “bookings” metric will have the latest data by 5 AM UTC, and if it is not available by this time, it is considered “late.”

How Often Are My Datasets Late?

As a first step, we set out to enable data producers to understand when data is landing and how frequently they meet SLAs in the Report view (Figure 1). In this view, producers can track real-time and historical trends across multiple datasets they own or care about. We also ensured that producers get value even when there is no formal SLA set by surfacing typical landing times. No SLAs were set when we first launched the tool, and SLAs may be unneeded for datasets that are not widely consumed.

Figure 1 SLA Report view provides a high-level overview of SLA performance across lists of datasets. Each row includes a status indicator of the latest data partition and bar charts that display historical landing times (red bars show the days when datasets missed SLAs).

The Report view makes use of traditional lists of data entities, with embedded small visuals that concisely summarize typical and historical landing time data. Data producers can organize their datasets across lists and collaborate on lists with others (e.g., their team).

With this data-rich summary, understanding landing times and SLA performance became as simple as curating a list of datasets.

Reporting Is the Tip of the Iceberg

While the Report view dramatically simplified understanding whether a dataset was late, it did not address two major challenges of SLAs:

  • What is a reasonable SLA for a dataset?
  • When a dataset is late, how do you understand why?

These questions are challenging because datasets are not independent from each other. Instead, datasets are derived stepwise in a specific sequence, where one or more transformations must happen before another (Figure 2).

Figure 2 An example of data lineage for creating dataset “A.” “A” depends on “B,” which depends on “C” and “D” and so on.

Thus, the availability of one dataset is intrinsically linked to a complex hierarchical “lineage” of data ancestors. To set a realistic SLA for a dataset, one has to take into account its entire dependency tree — sometimes comprising 100s of entities — and their SLAs.

To add to the complexity, when things go wrong, trying to match up the hierarchical dependencies with the temporal sequence makes SLA misses hard to reason about without visual aid. Existing tooling at Airbnb enabled data engineers to identify problems within their own data pipeline, but it was exponentially more difficult to do this across pipelines which are often owned by different teams.

Why Is My Dataset Late?

To enable data producers to identify the root cause(s) of an SLA hit or miss across data pipelines, and to set realistic SLAs by taking into account full data lineage, we designed the Lineage view.

Early Design Attempt

To be successful, the Lineage view needed to enable data producers to reason about both dataset dependencies and the timelines of those dependencies. Since data lineages can include 10s — 100s of tables, each with 30 days of historical data, SLAs, and relationships between them all, we needed to concisely represent on the order of 1,000s — 10,000s of individual data points.

In our initial explorations, we heavily emphasized lineage over landing time sequence (Figure 3). Although it was easy to understand dependencies for small lineages, it failed to highlight which dependencies caused delays in the overall pipeline on a given run, and it was difficult to understand where time was spent to produce the dataset overall.

Figure 3 Early exploration with emphasis on dataset lineage. Each box shows historical landing times for each dataset in the larger data pipeline.

Focus On Time With the Timeline View

We then pivoted to emphasizing temporal sequence over lineage. To do this, we designed a dependency-inclusive gantt chart (Figure 4) with the following features:

  • Each row represents a dataset in the lineage, with the “final” dataset of interest at the top.
  • Each dataset has a horizontal bar representing the start, duration, and end time for its data processing job on the date or time selected.
  • If a dataset has an SLA, it is indicated with a vertical line.
  • Distributions of typical start and end times are marked to help data producers evaluate if the data processing job is ahead of schedule, or behind, putting its downstream datasets at risk.
  • Arcs are drawn between parent and child datasets so data producers can trace the lineage and see if delays are caused by upstream dependencies.
  • Emphasized arcs represent the most important “bottleneck” path (described below).
Figure 4 The Timeline view gives a clear sense of sequence and the duration of data transformations, while preserving important hierarchical dependencies which provide sequence context. Historical landing times are displayed for each dataset row, to the left of the current run.

With this design, it became easy to find the problematic step — often the long, red bar — or to identify system-wide delays, where all steps just took longer than usual (lots of yellow bars, each past their typical landing time). This visualization is used by many teams today at Airbnb to debug data delays.

Finding the Needle in the Haystack — “Bottlenecks”

For datasets with very large dependency trees, we found it difficult to find the relevant, slow “bottleneck” steps that delay the entire data pipeline. We were able to drastically reduce noise and highlight these problematic datasets by developing the concept of a “bottleneck” path — the sequence of the latest-landing data ancestors which prevented a child data transformation from starting, thus delaying the entire pipeline (Figure 5).

Figure 5 Comparison of a full data lineage (left, n=82) versus just the filtered “bottleneck” path (right, n=8). Bottleneck paths significantly improve signal-to-noise, and make it easy to find problematic steps of large data pipelines possible.

Is it Me or You? Diving into the Historical View

Once the bottleneck step was identified, the next important question became whether the delay in that step was due to long runtimes or delays in upstream dependencies. This helps data producers understand whether they need to optimize their own pipeline or instead negotiate with owners of upstream datasets for earlier SLA’s. To enable this, we built a view of the detailed historical runtimes of a single dataset, showing both when they ran, and the duration (Figure 6).

Figure 6 Historical runtimes and duration distributions quickly distinguish whether SLA misses (red) are due to starting late (top) versus long runtime durations (bottom).

By combining these complementary views in SLA Tracker, we were able to provide a full perspective of data timeliness (Figure 7).

Figure 7 SLA Tracker is comprised of multiple views. The Report view provides an overview of dataset status, the Lineage view enables root cause analysis of data landing times, and the Historical view captures historical trends in detail.

Process and Tooling

We spent roughly 12 months conceptualizing, designing, prototyping, and productionizing SLA Tracker. Much of this time was spent developing the data APIs that power the UI, and iterating on the Lineage view.

For the simpler Report view, we leveraged static designs and click-through prototypes with generic mock data. Throughout alpha and beta product releases, we iterated on visual language and made data more visual to improve comprehension (Figure 8).

Figure 8 Evolution of making the “current vs typical” landing time visual.

We used an entirely different approach in designing the Lineage view. Its information hierarchy is dictated by the shape of the data, which makes prototyping with real data samples critical. We built these prototypes in TypeScript using our low-level visx visualization component suite for React, which allows for partial code reuse during productionization (Figure 9).

Figure 9 Evolution of the Lineage Page gantt chart (left to right): early multi-run summary boxplots; single run tracks with dependency arcs; “bottleneck” simplification.

After we were confident in our visualization, we refined the visual elements in static mocks in Figma before productionizing (Figure 10).

Figure 10 Developing a simple but coherent design language (left) across SLA Tracker views (right) helped balance information density by making elements more quickly understandable.

Conclusion

In this project, we have applied data visualization and UI/UX design — the interdisciplinary craft we refer to as “Data Experience” — to important data timeliness problems that require deep understanding of complex temporal and hierarchical information. This has enabled us to make data timeliness insights accessible, even in the complex data ecosystem of a large-scale company. It requires time and iteration to develop sophisticated visual analytics tools but the resulting product can provide great value to your organization.

Acknowledgements ❤️

SLA Tracker was the culmination of the efforts of many people and teams. While we focus on the data visualization aspect in this article, there were other important challenges we had to overcome in order to make the analytical tool possible. Thanks to the entire team who worked on the frontend, backend, and data engineering to make this product possible: Conglei Shi, Erik Ritter, Jiaxin Ye, John Bodley, Michelle Thomas, Serena Jiang, Shao Xie, Xiaobin Zheng, and Zuzana Vejrazkova.

All trademarks are the property of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.


Visualizing Data Timeliness at Airbnb was originally published in Airbnb Engineering & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Airbnb

Leave a Reply

Your email address will not be published.


*