In the scenario seen in Figure 7, a certain dimension in dimension source 1 is updated. Given that this dimension is being used by two dimension sets (i.e., A123 and B123), data versions associated with these two dimension sets will get updated accordingly. Since the data versions are now updated, backfills for these two dimension sets will kick in automatically. In Minerva, the cycle of new changes resulting in new data versions, which in turn trigger new backfills, is what allows us to maintain data consistency across datasets. This mechanism ensures that upstream changes are propagated to all downstream datasets in a controlled manner and that no Minerva dataset will ever diverge from the single source of truth.

To address this challenge, we created a parallel computation environment called the Staging environment. The Staging environment is a replica of the Production environment built from pending user configuration modifications. By performing the backfills automatically within a shared environment prior to replacing their Production counterparts, Minerva applies multiple unreleased changes to a single set of backfills. This has at least two advantages: 1) Users no longer need to coordinate changes and backfills across teams, and 2) Data consumers no longer experience data downtime.

The data flow for the Staging environment is as follows:

  1. Users create and test new changes in their local environment.
  2. Users merge changes to the Staging environment.
  3. The Staging environment loads the Staging configurations, supplements them with any necessary Production configurations, and begins to backfill any modified datasets.
  4. After the backfills are complete, the Staging configurations are merged into Production.
  5. The Production environment immediately picks up the new definitions and utilizes them for serving data to consumers.
Figure 8: A configuration change is first loaded into Staging and then merged to Production when release-ready.

The Staging environment allows us to have both consistency and availability for critical business metrics, even when users update definitions frequently. This has been critical for the success of many mass data migrations projects within the company, and it has aided efforts to revamp our data warehouse as we focused on data quality.

Figure 9: A user’s development flow using the Minerva prototyping tool.

To do this, we created a guided prototyping tool that reads from Production but writes to an isolated sandbox. Similar to the Staging Environment, this tool leverages the Minerva pipeline execution logic to generate sample data quickly on top of the user’s local modifications. This allows users to leverage new and existing data quality checks while also providing sample data to validate the outputs against their assumptions and/or existing data.

The tool clearly shows the step-by-step computation the Minerva pipeline will follow to generate the output. This peek behind-the-curtain provides visibility into Minerva computation logic, helps users debug issues independently, and also serves as an excellent testing environment for the Minerva platform development team.

Finally, the tool uses user-configured date ranges and sampling to limit the size of the data being tested. This dramatically speeds up execution time, reducing iteration from days to minutes, while allowing the datasets to retain many of the statistical properties needed for validation.

To confirm this hypothesis, Alice decided to analyze the nights_booked metric, cut by the dim_listing_urban_category dimension. She knew that the nights_booked metric is already defined in Minerva because it is a top-line metric at the company. The listing dimension she cared about, however, was not readily available in Minerva. Alice worked with her team to leverage the Global Rural-Urban Mapping Project and GPW v4 World Population Density Data¹ created by NASA to tag all listings with this new metadata. She then began to prototype a new Minerva dimension using this new dataset.

Figure 10: Alice configures the new dimension in a dimension source.

Alice also included the new dimension definition in several dimension sets used across the company for tracking the impact of COVID-19 on business operations.

Figure 11: Alice adds the new dimension to the COVID SLA dimension set owned by the Central Insights team.

To validate this new dimension in Minerva, Alice used the prototyping tool described above to compute a sample of data with this new dimension. Within minutes, she was able to confirm that her configuration was valid and that the data was being combined accurately.

Figure 12: Alice was able to share sample data with her teammate within a few minutes.

After validating the data, Alice submitted a pull request for code review from the Core Host team, which owns the definition of all Listing metadata. This pull request included execution logs, computation cost estimates, as well as links to sample data for easy review. After receiving approvals, Alice merged the change into the shared Staging environment where, within a few hours, the entire history of the modified datasets were automatically backfilled and eventually merged into Production.

Figure 13: With Alice’s change, anyone in the company could clearly see the shift in guest demands as travel rebounds.

Using the newly created datasets, teams and leaders across the company began to highlight and track these shifts in user behavior in their dashboards. This change to our key performance indicators also led to new plans to revamp key product pages to suit users’ new travel patterns.



Source link