Wednesday, October 21, 2009

When A/B testing isn't as simple as A, or B

I was recently involved in the launch of a new algorithm, which is expected to improve the business efficiency of a product at work (this algorithm employs machine learning, but more on that in a later post). As is often the case in software systems, measuring the gains realized from this new algorithm was essential: the key questions we wanted answered post launch were - how much better does the new algorithm do as compared to the old one? Do the benefits match up to the gains we had projected before launch? For brevity, I'll refer to the new one as Alg-New and the old one as Alg-Old.

(Now, I cannot divulge much more about the nature of the business problem we were trying to solve, but that's not what this post is about anyway.) Describe the above problem to any engineer, and you're likely to be given the quick reply: A/B tests! I had read about A/B tests at various points, but never quite conducted one in production. The basic premise is -
  •  you partition your target population (population here implies the entity you're optimizing or operating on, it is often a customer but needn't necessarily be so) into two sets: control and treatment. The treatment set would be handled by Alg-New, while control set continues to go through Alg-Old
  • The partitioning scheme should result in the control and treatment sets having same or similar performance on the final evaluation metrics before launch, i.e. having not introduced Alg-New, there should be little difference in how the two population sets perform. An evaluation metric is a business/performance metric on which you'd measure the end performance of each of these sets
  • You expose only x% of the population to the treatment set, so no changes affect the control set. This setup lets you compare performance of Alg-New to Alg-Old over the same period of time, on populations that were not dissimilar to begin with
 Seems simple enough at the outset, but I thought the questions that came up during the design of the A/B test were interesting. Here are some of the challenges that we faced during the design of the experiment -
  1. What if the performance of control and treatment sets on the evaluation metrics is not quite similar?This would seem to violate one of the premises of the experiment: that control and treatment sets be similar in all known respects, only then can we assume that gains in the treatment set would translate to equivalent gains in the control set
  2. And further, what if this performance changes over time? Most A/B tests are likely to require a few days to gather enough data, hence performance over time is also relevant
We thought about this for a while, and here are the ideas we came up with -
  1. Freeze the measured population sets at a point in time, i.e. partition the population into static control and treatment sets with equivalent performance, and only use these static sets to measure performance post launch. This excludes new entities from being introduced into the control and treatment sets, so this would work iff the main reason for variation in performance of control and treatment sets were new entities being introduced over time. This assumption did not hold true in our case
  2. After a certain period after launch, switch control and treatment sets. Assuming control and treatment sets were dissimilar in performance before launch, if Alg-New can provide similar relative gains even after switching control and treatment sets, it proves that the cause of dissimilar absolute performance of control and treatment sets is not related to Alg-New itself. This is something we will be trying out
  3. Try to determine if there are a few outliers in control or treatment sets causing their performance to deviate. An important criteria is: outliers based on what attributes? We didn't find this approach to be that useful
  4. The final, and perhaps most important approach, is to use some measure of statistical significance to compare the distributions: (performance of control set - performance of treatment set) before and after launch. Inevitably, the performance of each set will vary day-day, before as well as after launch, so it makes most sense to treat the difference between the two as a variable, and then you get two distributions for it: pre and post launch. And you analyze if the two distributions are significantly different from one another. I'm still learning some of the basics of determining statistical significance, and have found this StatSoft resource to be a decent starting point
Recommended reading: The Exp Platform at Microsoft has published some good papers that describe the current state of the art, in particular I would recommend Seven Pitfalls to Avoid when Running Controlled Experiments on the Web and Practical Guide to Controlled Experiments on the Web