How we ensure credible analytics on Dropbox mobile apps

Adding user analytics to a feature may seem like a straightforward task: add events to all the appropriate call sites, pass the event names to product analysts who will use the events for their investigations, and voilà—you have analytics for your feature! 

In reality, the user analytics story is a lot more complicated. When creating new features, engineers rarely think about analytics and how they will measure the success of the new feature. Feature analytics are often added at the last minute by request of the business stakeholders. Exact details of what analytics are needed and what type of insights the data should be used for are often not specified. As a last-minute addition, the analytics code tends to be poorly-designed and not well-tested. After the initial launch of the feature, analytics often become forgotten and unobserved, and with continued development of new features without the sufficient code coverage, the old analytics becomes untrustworthy. It would almost be better to not have any analytics at all than make decisions on faulty data! 

My team at Dropbox recently invested in ensuring the integrity of our analytics. We wanted to share what we learned and what we would have done differently.

Our approach

Together with product analysts and product managers, we came up with three pillars of credible and useful analytics:

  1. Intentional: analytics should answer real business questions that measure feature success important to stakeholders
  2. Credible: analytics should be well-tested to prevent degradation
  3. Discoverable: analytics should be easy to find, understand, and use by business stakeholders

Intentional data: Data as a feature success indicator

Oftentimes as engineers we do not meaningfully engage with product managers and product analysts to understand the questions they are trying to answer. We simply turn their request to “add analytics here” into a JIRA ticket, then into a pull request, then finally wash our hands of the entire process. This leads to messy analytics code being put in last minute and without much thought. What is worse, without a clear structure and guidance on how to intentionally add analytics, we may end up accidentally measuring the wrong things. To ensure the events that we log are intentional, engineers must play a more active role in how analytics get added to our codebases. We should ask ourselves, why are we adding feature analytics in the first place? The answer is, to measure the success of the feature, of course! So how do we measure success? We ask our product analysts and product managers to come up with business questions they want to answer. 

As an example, these are some of the questions we use to measure photo upload feature success on mobile apps:

  • How long does it take to upload a photo?
  • What is the upload completion rate?
  • How many of our mobile users are uploading photos? 
  • What is the week-over-week retention rate for photo uploads?

We then can design and instrument analytics directly to answer these questions. We also significantly speed up the implementation and testing of analytics.

Credible data: data as code

Product analysts are interested in answering questions involving complex user scenarios spanning multiple screens and user interactions. As developers, we write tests to validate that our code behaves as expected. However, since unit tests are designed to test the smallest unit of code, they cannot be used to ensure analytics are being captured properly as a user moves throughout the app. To test user flows throughout the app, we employ our instrumented UI test infrastructure. In a UI test, we simulate users going through complex workflows in our mobile app and capture the analytics events for those workflows. These analytics events are then serialized into JSON format and compared against an expected output.

Comparing complex outputs saved to a text file is called snapshot testing, or textual snapshot testing. The main idea behind textual snapshot testing is that the expected output gets stored in a text file and when the test runs, a new output is generated. The new output is then compared against the expected one. If the outputs match, the test passes, otherwise the test fails with an error showing the difference between the two outputs. Snapshot testing is not a new concept and has been widely applied in web development. React’s Jest framework leverages snapshot testing to test UI component structure without actually rendering any UI.

Let’s look at a snapshot test example. Here is an instrumented test that tests the photo upload scenario in our Android app:

@get:Rule
val snapshotTestRule = SnapshotRule.Creator().create()

@Test
@ExpectedSnapshot(R.raw.test_upload_image)
fun testUploadImage() {
    uploadPhotoButton.click()
    uploadPhoto("test_photo.jpg")
    waitUntilSuccessfulUpload()

    snapshotTestRule.verifySnapshot()
}

In this test, we instrument a user scenario where a user clicks on an upload button, chooses a file to upload, and waits until the upload finishes successfully. While the test is running, it collects analytics events of interest as defined in our test rule. 

@get:Rule
val snapshotTestRule = SnapshotRule.Creator()
    .metricsToVerify(
        "upload_button.clicked"
        "upload.success",
        "upload.start"
    )
    .attributesToVerify(
        "extension"
    )
    .create()

After the last user scenario instrumentation call, we run verifySnapshot. This verifies that the recorded events match the expected textual snapshot. The expected snapshot is a raw resource that we refer to in the test header. Below is what the contents of the test_upload_image snapshot file look like:

    [
      {
        "event": "upload_button.clicked"
      },
      {
        "event": "upload.start",
        "extension": "jpg"
      },
      {
        "event": "upload.success",
        "extension": "jpg",
        "total_time_ms": "EXCLUDED_FOR_SNAPSHOT_TEST"
      }
    ]

In the example above, note the presence of the EXCLUDED_FOR_SNAPSHOT_TEST value. Some of the fields we record for analytics events are not idempotent and will be different each time the code is run since we are not mocking the majority of the systems in the UI test. For non-idempotent fields, we chose to not test the recorded value and instead ensure the presence of the field. This way, if field tagging for the event is changed or removed in the future, the snapshot test will catch it. 

Discoverable data: anyone can gain insight from data

With user analytics added, verified, and put under test, the next step is to make the insights gathered from analytics accessible to the team and stakeholders. On the Mobile team, we have dashboards dedicated to each core Dropbox feature with graphs answering questions about the feature use (e.g. How long does it take for the Home screen to load?). The data would be useless if it does not get used and monitored consistently. At Dropbox Mobile, we have a combination of monitoring for data anomalies as well as manual dashboard reviews by analysts and oncall engineers. The specifics of how and what tools to use when sharing data will be different for each company. The important thing to remember here is that after defining a bunch of questions to answer with our data we can now finally get answers and monitor insights over time! 

Results

The process of making analytics intentional, credible, and discoverable can be applied to analytics for completely new features and existing features. Here are a couple of examples how this work helped ensure credible analytics for our mobile apps: 

  • We discovered that some of our existing analytics were double logged. For example, when the user clicked on some of the navigation buttons, we noticed that the logging event was being fired twice. We fixed the issues, and by creating snapshot tests for navigating through the app, made sure that the core navigation analytics stay credible.
  • We learned that some of the analytics we thought we were recording correctly turned out to be unusable by our stakeholders. For example, when the app sign in failed, we logged the localized error message shown to the user in a dialog. Since Dropbox is translated into 22 languages and there are many possible sign in failures, it is very difficult to visualize the breakdown of sign in error reasons without a lot of work mapping the translated error messages to unified error types. To fix the logging, we used an enum representing various error states (for example,  GOOGLE_AUTH_FAIL). By changing the logging to consistent enums, we can now easily visualize the sign in error breakdown, and enable focusing on most impactful user sign in errors.
  • As we invest in rearchitecting some of our code, for example, deprecating some of the legacy C++ into native code, snapshot tests will ensure that the feature analytics do not degrade during the migration. 

Overall we are incredibly satisfied with the impact of our investment to ensure integrity of Dropbox mobile analytics. 

We perform biweekly releases for our Mobile apps at Dropbox, so getting the data right on the first go is crucial. If we make a mistake in logging the data, we have to wait for two weeks to fix it. By engaging early and often with our product and analyst counterparts, we can more fully understand the value of our analytics. By knowing the questions we are trying to answer we can more meaningfully and thoughtfully add analytics code to our codebases. Lastly, by snapshot testing the events and fields for our analytics we can ensure the accuracy of our data. 

Thanks to Alison Wyllie, Amanda Adams, Angella Derington, Anthony Kosner, Mike Nakhimovich, and Sarah Tappon for comments and review.  And thanks to Zhaoze Zhou and Angella Derington for helping make mobile analytics healthy and happy.

Source: Dropbox