In the scenario seen in Figure 7, a certain dimension in dimension source 1 is updated. Given that this dimension is being used by two dimension sets (i.e., A123 and B123), data versions associated with these two dimension sets will get updated accordingly. Since the data versions are now updated, backfills for these two dimension sets will kick in automatically. In Minerva, the cycle of new changes resulting in new data versions, which in turn trigger new backfills, is what allows us to maintain data consistency across datasets. This mechanism ensures that upstream changes are propagated to all downstream datasets in a controlled manner and that no Minerva dataset will ever diverge from the single source of truth.
Now that we have explained how Minerva uses data versioning to maintain data consistency, a keen user might already observe a dilemma: the rate of backfills competes with the rate of user changes. In practice, backfills often could not catch up with user changes, especially when updates affect many datasets. Given that Minerva only surfaces data that is consistent and up-to-date, a rapidly changing dataset could end up in backfill mode forever and cause significant data downtime.
To address this challenge, we created a parallel computation environment called the Staging environment. The Staging environment is a replica of the Production environment built from pending user configuration modifications. By performing the backfills automatically within a shared environment prior to replacing their Production counterparts, Minerva applies multiple unreleased changes to a single set of backfills. This has at least two advantages: 1) Users no longer need to coordinate changes and backfills across teams, and 2) Data consumers no longer experience data downtime.
The data flow for the Staging environment is as follows:
- Users create and test new changes in their local environment.
- Users merge changes to the Staging environment.
- The Staging environment loads the Staging configurations, supplements them with any necessary Production configurations, and begins to backfill any modified datasets.
- After the backfills are complete, the Staging configurations are merged into Production.
- The Production environment immediately picks up the new definitions and utilizes them for serving data to consumers.
The Staging environment allows us to have both consistency and availability for critical business metrics, even when users update definitions frequently. This has been critical for the success of many mass data migrations projects within the company, and it has aided efforts to revamp our data warehouse as we focused on data quality.
Defining metrics and dimensions is a highly iterative process. Users often uncover raw data irregularities or need to dig deeper to understand how their source data was generated. As the source of truth for metrics and dimensions built on top of automatically generated datasets, Minerva must help users validate data correctness, clearly explain what is happening, and speed up the iteration cycle.
To do this, we created a guided prototyping tool that reads from Production but writes to an isolated sandbox. Similar to the Staging Environment, this tool leverages the Minerva pipeline execution logic to generate sample data quickly on top of the user’s local modifications. This allows users to leverage new and existing data quality checks while also providing sample data to validate the outputs against their assumptions and/or existing data.
The tool clearly shows the step-by-step computation the Minerva pipeline will follow to generate the output. This peek behind-the-curtain provides visibility into Minerva computation logic, helps users debug issues independently, and also serves as an excellent testing environment for the Minerva platform development team.
Finally, the tool uses user-configured date ranges and sampling to limit the size of the data being tested. This dramatically speeds up execution time, reducing iteration from days to minutes, while allowing the datasets to retain many of the statistical properties needed for validation.
To illustrate how everything fits together, let’s walk through an example of how Alice, an analyst, was able to turn data into shared company insights with Minerva. As described in our first post, COVID-19 has completely changed the way people travel on Airbnb. Historically, Airbnb has been pretty evenly split between demand for urban and non-urban destinations. At the onset of the pandemic, Alice hypothesized that travelers would avoid large cities in favor of destinations where they could keep social distance from other travelers.
To confirm this hypothesis, Alice decided to analyze the
nights_booked metric, cut by the
dim_listing_urban_category dimension. She knew that the
nights_booked metric is already defined in Minerva because it is a top-line metric at the company. The listing dimension she cared about, however, was not readily available in Minerva. Alice worked with her team to leverage the Global Rural-Urban Mapping Project and GPW v4 World Population Density Data¹ created by NASA to tag all listings with this new metadata. She then began to prototype a new Minerva dimension using this new dataset.
Alice also included the new dimension definition in several dimension sets used across the company for tracking the impact of COVID-19 on business operations.
To validate this new dimension in Minerva, Alice used the prototyping tool described above to compute a sample of data with this new dimension. Within minutes, she was able to confirm that her configuration was valid and that the data was being combined accurately.
After validating the data, Alice submitted a pull request for code review from the Core Host team, which owns the definition of all Listing metadata. This pull request included execution logs, computation cost estimates, as well as links to sample data for easy review. After receiving approvals, Alice merged the change into the shared Staging environment where, within a few hours, the entire history of the modified datasets were automatically backfilled and eventually merged into Production.
Using the newly created datasets, teams and leaders across the company began to highlight and track these shifts in user behavior in their dashboards. This change to our key performance indicators also led to new plans to revamp key product pages to suit users’ new travel patterns.