After this realization, our technology teams aligned on several areas of work.

  1. Measurement & Accountability: creating a clear accounting structure for all AWS usage aimed largely at understanding holistically how we consume and utilize AWS services.
  2. Cost Efficiency: identifying and eliminating waste
  3. Process & Governance: defining operational process and guardrails to inform decisions, and improve our ability to forecast future demand

In partnership with finance, our technology teams have made tremendous progress toward our operational efficiency goals, and continue to build amazing products in service of the Airbnb business.

Knowing your culture is an important consideration before starting any major change. Airbnb’s engineering culture is one of “you build it, you operate it”, and we pride ourselves on making data informed decisions. This made two things clear. First, adding significant friction for our engineers would be met with heavy resistance, and second, we needed more investment in our AWS cost and attribution data to develop actionable insights.

When starting out, we relied on existing systems to enable our program. We have an internal employee directory as the source of truth for teams. System ownership is defined in an internal tool, Scry. We use Apache Superset, a data exploration and visualization platform designed to be intuitive and interactive. We leverage Terraform as our configuration-as-code solution, which supports most of our tagging by ensuring an AWS Resource is attributed to a Project. Some of our AWS Resources are not created via Terraform, and for these, we created an alternative mechanism directly in our codebase.

We began to ingest the Cost and Usage Report (CUR), the most comprehensive source of AWS billing data available. Building on top of Airbnb’s robust data warehouse infrastructure, the team combined the cost and usage data with teams data and system ownership data to develop an evolving picture of Airbnb’s cost footprint which we call the “Airbnb CUR”. The Airbnb CUR powers a suite of Superset dashboards and metrics which support every pillar of the cost efficiency program, and also the downstream world of consumption attribution. We will share our technical approach for building this data in an upcoming post.

AWS announced their Savings Plan in late 2019. We have realized its benefits and now have most of our compute resources covered under this arrangement. We monitor our Savings Plan utilization regularly to minimize On Demand charges and maximize usage of purchased Savings Plan. It can be challenging to predict our compute needs, so flexibility is essential. Today we have a set of prepared responses which move certain workloads on and off Savings Plan to keep utilization healthy. We enhanced the capability of our Continuous Integration environment to leverage spot instances. With a small configuration change, we can easily dial up our use of spot instances if we observe On Demand charges. When we are under-utilizing our Savings Plan, we move over EBS in our data warehouse to EC2 to ensure we stay close to maximum utilization of our savings plan.

The nimble engineering culture at Airbnb enables engineers to build and improve services autonomously, especially as AWS introduces more advanced offerings. We purchase a 3 year convertible savings plan to give us flexibility to migrate to new instances types. For us, this flexibility offsets the potential savings from instance specific savings plan purchases. In addition to Savings Plan for our compute capacity, we leverage Reserved Instances for RDS & ElastiCache.

Purchasing the right amount of Savings Plan requires ongoing communication and evaluation. Before having a cost efficiency team, there was minimal evaluation into whether a spike increase was due to a short term usage increase or a permanent increase we should factor into our purchasing strategy. As a result, it was easy to make uninformed purchases. We now project overall usage before making savings plan purchases by keeping in touch with dozens of engineering teams. Knowing ahead of time if major services are going to turn down or up helps ensure we don’t over or under purchase. The efficiency team also works with engineering teams to stagger operations that require temporary compute so that the total usage doesn’t create high On Demand costs. Constant vigilance is critical for capacity planning success.

When choosing the most cost effective storage tier, you need to consider the access pattern for the data along with the file size and number of objects in the S3 bucket, as there can be unexpected costs. Take Glacier, as an example. For each object stored in Glacier, S3 stores an additional 32KB data in “Standard” storage class. So if you store an object to Glacier, with 1 KB in size, S3 will put an extra 32KB in Standard, both charged at corresponding prices. So while Glacier is only 10% the cost of Standard storage class, the total cost can be higher than simply storing the data in Standard.

Compute costs are the single largest line item on our monthly bill and cost efficiencies in this area have a big impact on our bottom line. While working to control our AWS costs, we are concurrently building new capability and improving our technology stack for the future. As part of this modernization we are moving to Kubernetes. During our effort to eliminate waste, we found a number of large services not using horizontal-pod-autoscaler (HPA), and services that were using HPA, but in a largely sub-optimal way such that it never effectively scaled the services (high minReplicas or low maxReplicas). A focused effort around service tuning improved our utilization, and also maximized the impact of the cluster auto-scaling work, which will be discussed next.

We started with a dashboard providing a view into how Airbnb’s overall AWS spend is distributed across different services. This enabled our monitoring track of work, discussed next. Our initial quick and dirty attribution was aimed at identifying high cost areas where efficiency opportunities could have the most impact. This approach was effective for the first 9–12 months of our cost saving work. It became clear, however, that for the long term, we needed a consistent pipeline architecture and scalable attribution approach so that all services could plug into a generalized attribution framework. Additionally, the first version focused on identifying the direct cost of operating our systems. The second version focused on how resources were consumed across systems to operate our site. This helped unlock key insights. For example the best way for a team to reduce their costs may not be to micro-optimize their resource usage, but to work with an upstream caller to call them less frequently.

In the earlier days of the cost efficiency team, consumption monitoring meetings involved considerable firefighting. The data would surface an anomalous spike in cost for a particular usage type and the monitoring group would begin a quick investigation to understand the root cause, reaching out to other teams to learn more. Over time, the group developed relationships with other teams at Airbnb and built a knowledge base of common pitfalls in cost management. Though spikes still happen, they are smaller in magnitude and less frequent than before.

As our program matures, we are also designating AWS Cost champions through all product development organizations to replicate the operational review forums and efficiency efforts at the local level with the central cost team supporting their efforts.

At our scale, cloud efficiency is a massive cross functional and cross-organizational effort. It requires technologists, data scientists and finance experts to collaborate, develop shared goals and track progress continuously. We maximized our effectiveness by developing a core team dedicated to developing a centralized view of cloud efficiency. However, this program would not be successful with only a core team. Our continued success depends on distributing responsibilities for cost efficiencies to individual teams who are closest to the cost/benefit tradeoffs.

This work, and many exciting things are always happening at Airbnb. If you want to join us, check out our Airbnb Careers page.

________________________________________________________________

Amazon Web Services, EC2, Amazon RDS, ElastiCache, and Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.

“Ruby on Rails” is the registered trademark of David Heinemeier Hansson.

Apache Superset, Apache, and Superset are either registered trademarks or trademarks of The Apache Software Foundation in the United States and/or other countries.

Terraform is the trademark of HashiCorp.

Kubernetes and K8s are the registered trademarks of The Linux Foundation in the United States and/or other countries.

All trademarks are the properties of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.



Source link