On Iterating at Facebook
During my time at Facebook, I learned a whole lot about A/B testing. This post is a brain dump in which I go over the importance of A/B testing, the gist of the components and infrastructure necessary to support an A/B testing framework, and how A/B testing infrastructure is part of the broader data infrastructure that powers machine learning on the web.
A big boon when it comes to building web applications is the ability to iterate quickly. You build a version of your product that will do the trick. Afterwards, it is super easy to ship out incremental improvements, so you can make your product progressively better.
These changes can either be visible UI changes, such as changing the color of a button, or invisible backend changes relating to the algorithms that power whatever it is you are building.
The way we measure whether a product change is better is through something called an A/B test. If you ask people who work at Facebook, you will almost certainly hear that Facebook is a very data-driven company. What that means is that at Facebook, every change starts and ends with A/B testing in mind.
An A/B test generally has two or more groups that contain a random sampling of people. At least one group in an A/B test should receive the current-default-product, dubbed the control. Since sets should be large enough to guarantee that – using the same product at the same time – we can assume that any differences in behavior are due to the different parameters supplied.
This type of iteration is impossible without a framework under which to test whether changes are making a positive impact.
How do we know whether our changes are making a positive impact, anyway? Metrics. Metrics are just aggregated numbers (or distributions of numbers) that quantify different aspects of how people are using your product. Examples of metrics in the social (and e-commerce) space are time spent, purchases, clicks, revenue, likes, and follows, among others.
All those metrics can be computed based on behavior. Some metrics might not be computed explicitly from actions taken, but rather be provided by people, such as when they rate your application or report a piece of content for being offensive. A third kind of invisible metrics might relate to the performance of your application: things like page load time and CPU consumption to handle a request.
Together, these three types of metrics can give you a good sense of the state of your product in both control and experimental groups.
You can think of your task during the iteration phase as optimizing for your top-line metrics with certain constraints on performance metrics.
Top-line metrics are the ones you really care about. The philosophy with top-line metrics is that they should be easy to record, hard to game, and track your business goals closely. I won’t drone on here. However, for a company like Yelp, tracking how many times people rate a restaurant is probably better than tracking how many times people visit a restaurant because it is infeasible to track how many people visit a restaurant, let alone because of your product.
In general, anything relating to positive actions taken within your product is an example of a metric that is easy to track and hard to game. Think time spent, likes, follows, content creation, etc.
Hopefully I have convinced you that A/B testing is important. In the next section, we’ll take a moderately technical dive into the functional aspects of building an A/B testing framework.
From a high level, your system needs to be able to perform the following tasks.
- Assign users to an experiment.
- Within the experiment, assign a user to a group, e.g. control or experimental.
- Given the group and experiment that the user is assigned to, pass the relevant set of parameters back to the user – fast. We don’t want to incur a lot of latency as a result of our A/B testing framework.
- Log data about user activity. Each data point of user activity should reference some experiment. For example, this can be done by joining the session ID of certain actions on a table of experiments that has the session ID.
- Finally, data pipelines should be built to aggregate the relevant metrics per group so that performance can be tracked.
Let’s take a deeper look at how we can accomplish each of our requirements.
One to Three: Assigning An Experiment. Points one through three can be handled with something like a user to i64 hashing function that then lets you bucketize your users. Practically speaking, you can assign a user to an experiment using hash(user)} % #experiments
and group using {hash(user)} % #groups
.
Once you map your user to an experiment and group, you can use a simple key/value store like Memcache to retrieve the parameters for the user and pass them back. The transaction should cost something in the single-digit milliseconds. After all, there are rather few sets of parameters to choose from and they can easily be cached on the server.
Note: a good hash function is one that is independent of the hash functions used in other portions of your application. Such a hash function ensures that experimental data will not be polluted by influence from other processes that use hash(user)
.
An Aside on Logging
Logging metrics is an art in itself.
Nonetheless, there is some pithy wisdom I can offer. You can log events in aggregate, i.e. when the session ends, or as soon as the relevant event occurs. In general, logging events as they occur is more reliable but requires greater overhead in your data store.
For example, you can easily log impressions as they occur. However, it might be more efficient to log the set of impressions after a client terminates a session. Waiting can incur the additional overhead on the backend because data needs to be stored for all active sessions. When your application is distributed over many hosts, doing so can become quite tricky. Thus, optimizing your logging by batching events is only definitely appropriate on the client-side, where CPU and network utilization are precious.
Only log what cannot be computed on the back end from the client. An example of something that you should log on the client side is impressions when scrolling, while something that you should not log on the client side is a follow. Follows must necessarily go through your server so there’s no reason to waste data on them.
Four and Five: Processing Our Data. Great. Now that know how to get your data into some tables, we just need to figure out the infrastructure to analyze our resulting data.
Some examples of tables you might have at this point are:
- Impressions
- Likes
- Purchases
- Load Data Per Request (time, failures, etc)
We can join on our tables as we please down the road as long as we have some way of identifying the session, experiment, and group each record belongs to.
It is important to note that this data infrastructure is not meant for production systems. Even though we might have a Purchases Hive table, you should not use the data to furnish requests in real-time. It would be slow and your logging probably is not robust enough to handle the authentication of purchases. The purpose of Hive is to store large amounts of data to generate offline insights for things like A/B testing and machine learning.
Hive is a database built on top of the Hadoop Distributed File System that allows you to process massive amounts of data by converting SQL like queries (HiveQL) into MapReduce jobs over the HDFS.
For example, we can compute the like through rate for a particular experiment by:
- filtering likes and impressions by the relevant experiments,
- joining likes on impressions by session ID, then
- computing the like through rate on a per experiment basis.
If, for example, we know the distribution of our like through rate to be Gaussian, we can use our actual data points (or some randomly sampled subset) to compute the mean and standard deviation of the metric for each experiment and later the p-value and z-score of the difference between the groups.
Thus we can assert things like: the mean of metric foo for group A is X and the mean for group B is Y. We are 99.5% confident that foo is higher in A than in B. If we were trying to optimize for foo, we could be confident that our change was a good one and put it out into production.
At Facebook, we could automate the computation of deltas like the one described by defining a data pipeline which was just a sequence of queries to be run and tables to be created with intermediate results in order to get to those magic numbers (mu, sigma squared) that we care about for each metric. I am sure that a commercial product that does something similar on top of Hive exists. Shoot me a message if you know what it is called.
In summary, tracking metrics is necessary for iterating on a web application. To do it effectively, you need to have quite bit of data infrastructure but the gist of what it needs to do is relatively intuitive. That data infrastructure is useful for things beyond computing metrics. For example, the same Hive tables that track your baseline metrics are critical for generating training data for your machine learning models.
You can also use the logging tables for ad hoc analysis of anything that you want to know about what is going on in your product – like, how many people looked at any particular pair of shoes. Hm, maybe that’s something you want to be computing all the time with a pipeline.
Note: the type of pipelines described here are called offline pipelines, i.e. ones that are run on a schedule and process data in batches rather than in real-time. Even more vogue are online pipelines that constantly run computations and update the relevant derived tables – which in turn constantly update your metrics or add new training data to your model, allowing you to maybe-probably catch bubbling trends more quickly. The architecture of those is a little different but varies greatly depending on their purpose. A topic for another day!
This is where I should point out that working at Facebook is awesome and the systems in place are great to use and make the job a pleasure. Contact me if you are interested in joining!