March 4, 2019

Lyft and the ☁️

Over the past few months, I have been working at Lyft, helping spin up the Capacity team over here.

Recently, the work of our team caught some attention when Lyft’s IPO filing and the disclosure of our $300 million three-year AWS contract became the talk of the town. The top comment on the Hacker News post about Lyft’s S-1 discussed the contract, and luminaries like Sarah Guo and Benadict Evans tweeted their two cents on cloud infrastructure spending.

As a generation of cloud-first companies have scaled into huge businesses, managing the interface between those businesses and cloud service providers has become an important part of operating a software company, and the complexity has continued to increase.

In this post, I want to talk about why cloud is the right choice for Lyft, the levers that we are building in Capacity to understand and control the company’s cloud resource usage, and some of the lessons we have learned along the way. Notably, the opinions expressed here are my own and do not reflect those of Lyft as an organization or any non-publicly stated opinions of individuals involved.

Using cloud services is absolutely the right choice for Lyft at this time.

If you were to have a look at a chart of our compute usage, the first thing you’d notice is that our usage is not consistent. At peak, we use almost twice as many machines as during our troughs. It’s also growing quickly. Even since we started building out the Capacity team earlier in the year, our footprint in terms of number of machines has increased by over 50%.

Not only do Lyft’s compute needs vary significantly with time of day and with time (read: business growth), but they are changing qualitatively as well, and just as quickly. Two examples of compute needs changing are our march into developing autonomous technology and our migration to Kubernetes. The former creates demand for specialized hardware, such as machines that are better fitted for image processing and machine learning workloads (read: GPU workers). With the latter, we are looking to run the largest machines available in our clusters in order to bin pack as many of our core services in one place as possible.

In the future, it may make sense for Lyft to move some workloads off of cloud services and onto our own infrastructure, but that day is not today. Not with this growth and not with this much change in what the company is doing.

With cloud spending as a fact of life, it is important that we build out the right organizational muscle in order to optimize our virtual data center usage.

That’s where Capacity comes in. This is the team that would be managing the layout of our data centers, if we had data centers.

Our team brings together:

engineering, supply chain, and operations [Beth, GE and University of Michigan]
experience designing data centers and measuring performance of large systems [Carlos, Facebook]
knowledge of building data pipelines for munging business data [Ilia, Lyft and Electronic Arts]
experience in building budgeting and chargeback tools [Jess, Twitter]
performance and infrastructure engineering [Patrick, Google]

However, instead of thinking about how to layout and maintain machines in physical space, we find ourselves thinking about the digital layout of our machines, asking ourselves questions like: under what accounts and in what regions is our infrastructure running? Who is using it? How much is that usage costing? And, are we utilizing the resources we take on effectively?

Ultimately, we build tooling to answer those questions — and resolve irregularities as quickly as possible.

The highest priority work on Capacity is to ensure that the company does not have stability problems related to scale — that is, to ensure that we don’t reach any hard limitations on compute resources.

If the cloud resources we need to run Lyft are not available, that is an acute failure on our team’s part. Everything else should be dropped until the issue is resolved. It is on us to work with cloud providers to ensure that the company is steering clear of both provider imposed and physical limits on resources.

An important part of our team’s work is simply to collect our provider imposed limits, our current usage against those limits, and how likely we are to have our usage jump up. Limits on each instances type are an obvious candidate to track; additionally, there are numerous non-instance related limits, such as Dynamo reads and writes, number of Dynamo tables, number of launch configurations, number of auto-scaling groups, etc. that are just as important to track and tweak to ensure smooth sailing.

However, it is important to know that limits are not guarantees of capacity, and we sometimes run into capacity constraints long before we hit any explicit limits. Thus, for peak events such as Halloween and New Year’s Eve, we build models of our projected usage and work with our providers to ensure they set aside that capacity explicitly.

Perhaps an unintuitive lesson we have learned with limits is that it is better to have sensible limits rather than arbitrarily high ones, since it helps bring to light when certain instance types are, for example, becoming much more popular. Often, that kind of alert can bring to attention to problems with over-provisioning.

Beyond ensuring Lyft can scale, we care about ensuring that Lyft’s cloud infrastructure position and meta-position make sense.

We try not to be too prescriptive about what machines different teams should be using. Engineers are able to spin up whatever resources they need, within reason. Part of this is, we understand that cloud infrastructure is not what will make or break Lyft. What matters at this stage of the company is executing as quickly and effectively as possible on building our technology and product.

Still, we have been known to make sweeping changes, such as by modernizing many of our instance families to the latest type. We also try to involve ourselves in discussions around what types of machines to run different parts of our infrastructure, depending on compute needs and the cost to operate different types.

In addition to Lyft’s cloud position, we also think about our meta-position: we do work behind the scenes to ensure that our usage correctly matches our reservations. Without going too in-depth, AWS offers a discounted rate for up-front commitments, called reserved instances. At Lyft, we purchase three-year no upfront convertible reserved instances; the operative word is convertible, which means that we can shift our reservations between types to match our usage. Because we have historically allowed engineers to spin up instances of all sorts, our team’s role has been to go behind the scenes and shuffle around our reservations on a weekly cadence by finding the optimal configuration to save the company the most money.

Measurement is one of the most important things that we do. The majority of our recent effort has been around building out visibility into who (particularly which service, team, and organization) is using what resources, what is their pattern of usage, and what is the trend.

Measurement is the most important thing we do because it blocks virtually everything else we can do. As was recently echoed by my friend Jared: you should not (and for all intents and purposes cannot) optimize the performance of any system whose performance you have not yet measured. For the purpose of our team, we equate performance to cost — as opposed to latency — but the idea is the same.

Another co-worker and teammate, Carlos Bueno, has echoed a similar sentiment in his 2013 talk on Mature Optimization; he cites the general approach to optimization as measuring really carefully and then thinking really hard. The idea is that no optimization effort is the same, but with proper data, smart people can make decisions to move the needle in the right direction.

For us, optimization starts with putting our measurement data into the hands of those who have the agency to move the needle — the people working on the systems we want to optimize. This is because people working on systems have a lot more context on how their systems work than we do — and are ultimately the ones that should be spearheading changes.

In the future, a team like ours could focus on the more tactical projects — such as ones with which engineers dive into the higher leverage spend problems beyond the walls of our tools.

Multi-tenant computing environments pose a challenge to accurate measurement as they obscure to whom usage should be attributed. Higher-order attribution becomes necessary to get an accurate understanding of usage.

At Lyft, many of the biggest spenders are multi-tenant platforms, such as data and machine learning, which are used by almost every engineer in the company. To measure that we spend some arbitrarily large amount of money on our machine learning infrastructure is not nearly enough granularity, neither for the people with agency to optimize the system (engineers, research scientists) nor for the people who care to monitor and steer our resource usage (management, finance).

As most of our services are moving to Kubernetes, our infrastructure will become more multi-tenant than not. This is because, in Kubernetes, multiple services run on a single logical cluster. Historically, we could know what service is running simply by reading the name of the cluster from our billing data — because there was only one service per cluster. However, this type of first order attribution will no longer tell much of the story. All we will know is that usage happened on Kubernetes.

Second-order attribution. To get ahead of the issue, we have worked on introducing higher-order attribution for our multi-tenant platforms. When I run a service on a Kubernetes cluster, the usage is first attributed to the Kubernetes cluster, but it should also be attributed to linked back to the service that I was actually running on the cluster. That is second-order attribution.

Third-order attribution. It does not stop there. Consider that you can have something like a team training a model, on the machine learning platform, which is running on Kuberentes. To capture the source of the usage, one would need to have second and third-order attribution.

Fourth-order attribution. Finally, imagine a data scientist, running a query on behalf of another team to which that data scientist does not belong, on the data platform, on Kubernetes. Now we don’t get to the source of who is using the query without fourth-order attribution.

Figuring out scenarios that would require fifth order attribution is possible. Fortunately, the benefits eventually diminish, so we can finally go to have a coffee break.

Once usage is well understood, measuring utilization is important to highlight resources that might be over-provisioned.

The next plane of measurement down from usage is utilization.

Measuring usage, as discussed in the prior section, is about what has been used by whom; measuring utilization is about the extent to which used resources have been utilized. It’s like if you order a cake and eat half of it — you’ve used an entire cake but only utilized half.

To make the discussion much more concrete, a team might use three machines to run a service. However, what we don’t know is how effectively those machines are being utilized; for example, if it turns out that the service does not experience degraded performance until it reaches 50% CPU utilization and that the service has never used more than 10% of its available CPU, it would appear that that particular team has over-provisioned that service by 200% — as running the service on a single machine would lead to no degradation in performance.

At Lyft, we are in the process of spinning up our equivalent of an RSWE or Production Engineering organization, whose goal it will be to build tooling to, among other things (read: making sure Lyft stays up — very important), do exactly this: understand how hot we can run our infrastructure without taking a hit in performance and stability.

Enforcing usage and utilization metrics through budgets is essential for a maturing company.

A key project in the first half of this year for Capacity has been building tooling around setting and reporting budgets for various groups within the company.

The reason these are important is that we have found our usage of resources trends up monotonically if left unchecked. Engineers spin up new services, build increasingly more complex and resource-intensive features — not to mention the latent storage growth associated with archiving our definitionally growing history of data.

Similar to budgeting and allocating headcount, we need to budget and allocate compute. Every single group in an organization like ours will always need more compute and more engineers. Sometimes, we need to have the muscle to push back.

In my experience, the only way to curb the creep is to strategically allocate resources based on budgets. That way, investment can be prioritized based on necessity and projected return on investment.

An important part of the work has been to build our tools strategically.

As the number of use cases our team addresses grows from providing teams with an overview of their spending to enabling deeper dives on the utilization of different clusters and views of multi-tenant environments such as Kubernetes and machine learning platforms and to budgeting and reporting and chargebacks and financial reporting, having a consistent data strategy has become increasingly more important.

Today, we spend a lot of time thinking about which data taps we need to build to power our tools, what data to present, and how to present it.

Overall, working on building Capacity really does feel like operating a data center — through the pared-down abstractions that cloud service providers offer. We don’t worry about power consumption or the lifecycle of parts, but it is quite a supply chain and ops-heavy problem nonetheless, and there are plenty of nuances to learn and complexity to manage.

Kudos

Lyft and the ☁️

Now read this

—