Nagender Parimi's blog

Monday, January 18, 2010

The Harry Potter problem

(Updated on 1/25 per comments below)

No, the problem I bring is not from the land of Harry Potter but rather about the effect this series of books has on similarity recommendations. Greg Linden wrote a post on this subject a while ago. The problem surfaces in item-to-item collaborative filtering, the kind reputed to have been used by Amazon. The intuition behind item-to-item collaborative filtering is that if a user buys product X and later product Y, then X and Y must be related or similar in some way. Now, the problem statement claims that the Harry Potter series is so highly popular that most customers would have bought it, hence it may not have much in common with the other books you may have bought. Nevertheless, if one were to apply plain collaborative filtering, it may pop up as one of the top similar products for most books, thereby diluting the relevance of the similarity feature. I had read Greg's post a while ago, but hadn't given it much thought until recently: so how severe is the Harry Potter problem?

One of the standard ways of applying collaborative filtering here is similarity measures, a popular one being cosine similarity. You represent all customers who bought product A as a vector, and do the same for product B. Then sim(A, B) = A.B / |A||B|. If A is the product a user has already bought, then you wish to find B such that it maximizes sim(A, B). One can see that if a Harry Potter book, H, had a large enough set of customers common with A, then it may drown out other products with fewer purchase data. Also, cosine similarity is symmetric, which does not appear to be a desirable property here: if say, "Seven habits of highly effective people" is 60% similar to a Harry Potter book, is the vice versa true as well? It wouldn't seem so, especially when this is posed as the question: which product will a customer buy next?

That formulation of the problem (what is the customer likely to buy next) leads to a probabilistic definition for recommendations: P(B|A) = P(AB)/P(A): what is the probability of a customer buying B, given that s/he purchased A? Since A is a given, we want to find B such it maximizes P(AB) in the RHS. As Richard Cole points out in comments below, data sparsity often poses problems here, i.e. you might find P(AB) to be zero for some Bs. To not discount such events that have not occurred in the training data yet, the numerator and denominator are often boosted with priors. He also points out other approaches below of dealing with data sparsity. I suppose quantifying the effect of the Harry Potter problem is hard because it depends entirely on the dataset, and getting access to real-world datasets is hard. The Netflix challenge (see below) is the best known publicly available commercial dataset. I guess I should try out these approaches on it!

Recommended reading:

Greg Linden's article in IEEE Spectrum

The Netflix challenge was an interesting event on this front, here's the winning team's strategy. Note thought that the focus there was more on extracting temporal relations from a movie reviews database, than with dealing with the Harry Potter problem directly

Sunday, November 1, 2009

Win 7 or Ubuntu?

I recently assembled a new PC for home use. My old PC was a Celeron with an on-board graphics card. I had it running dual boot between Windows XP and Ubuntu (KDE), and was using Kubuntu as my primary OS. But it was woefully inadequate at keeping up with the demands of today's websites: CPU usage was almost always high whenever I had Firefox running for long. And memory was the next major resource constraint.

Here are the parts I used for my new PC -

Quad core Intel i5 CPU

nVidia GeForce 9800 graphics card
Gigabyte micro ATX motherboard

Newegg.com was my primary resource for researching the configuration and it worked quite well. My motherboard choice was restricted by two other constraints: it had to have EIDE support and its form factor had to be micro ATX. Next, on to assembling the PC: after stumbling through the initial steps, I found the process was really not that hard. I did get some very useful tips when I was stuck by calling up the respective parts' technical support.

The next decision to be made was regarding the OS: I had already decided to install both Windows 7 and Kubuntu, so the question really was which one would be my primary OS. After playing around with each for a few hours, I actually found Windows 7 to be more light weight. While most of the changes in Windows 7 may be cosmetic, I must admit that at first glance, it feels more slick and snappy compared to Kubuntu 9. Plus there are applications that still don't "just work" on Linux, e.g. getting Flash player to work on a 64-bit machine. So Windows 7 it is.

Update: Though the experience of putting together a PC was enlightening in some ways, I don't feel like I learnt a whole lot that I might find useful as a software/applications engineer. For instance, it was good to learn about EIDE vs SATA drives, or ATX and micro ATX motherboards, but the next time, I think I might just buy a PC off the shelf.

Wednesday, October 21, 2009

When A/B testing isn't as simple as A, or B

I was recently involved in the launch of a new algorithm, which is expected to improve the business efficiency of a product at work (this algorithm employs machine learning, but more on that in a later post). As is often the case in software systems, measuring the gains realized from this new algorithm was essential: the key questions we wanted answered post launch were - how much better does the new algorithm do as compared to the old one? Do the benefits match up to the gains we had projected before launch? For brevity, I'll refer to the new one as Alg-New and the old one as Alg-Old.

(Now, I cannot divulge much more about the nature of the business problem we were trying to solve, but that's not what this post is about anyway.) Describe the above problem to any engineer, and you're likely to be given the quick reply: A/B tests! I had read about A/B tests at various points, but never quite conducted one in production. The basic premise is -

you partition your target population (population here implies the entity you're optimizing or operating on, it is often a customer but needn't necessarily be so) into two sets: control and treatment. The treatment set would be handled by Alg-New, while control set continues to go through Alg-Old
The partitioning scheme should result in the control and treatment sets having same or similar performance on the final evaluation metrics before launch, i.e. having not introduced Alg-New, there should be little difference in how the two population sets perform. An evaluation metric is a business/performance metric on which you'd measure the end performance of each of these sets
You expose only x% of the population to the treatment set, so no changes affect the control set. This setup lets you compare performance of Alg-New to Alg-Old over the same period of time, on populations that were not dissimilar to begin with

Seems simple enough at the outset, but I thought the questions that came up during the design of the A/B test were interesting. Here are some of the challenges that we faced during the design of the experiment -

What if the performance of control and treatment sets on the evaluation metrics is not quite similar?This would seem to violate one of the premises of the experiment: that control and treatment sets be similar in all known respects, only then can we assume that gains in the treatment set would translate to equivalent gains in the control set
And further, what if this performance changes over time? Most A/B tests are likely to require a few days to gather enough data, hence performance over time is also relevant

We thought about this for a while, and here are the ideas we came up with -

Freeze the measured population sets at a point in time, i.e. partition the population into static control and treatment sets with equivalent performance, and only use these static sets to measure performance post launch. This excludes new entities from being introduced into the control and treatment sets, so this would work iff the main reason for variation in performance of control and treatment sets were new entities being introduced over time. This assumption did not hold true in our case
After a certain period after launch, switch control and treatment sets. Assuming control and treatment sets were dissimilar in performance before launch, if Alg-New can provide similar relative gains even after switching control and treatment sets, it proves that the cause of dissimilar absolute performance of control and treatment sets is not related to Alg-New itself. This is something we will be trying out
Try to determine if there are a few outliers in control or treatment sets causing their performance to deviate. An important criteria is: outliers based on what attributes? We didn't find this approach to be that useful
The final, and perhaps most important approach, is to use some measure of statistical significance to compare the distributions: (performance of control set - performance of treatment set) before and after launch. Inevitably, the performance of each set will vary day-day, before as well as after launch, so it makes most sense to treat the difference between the two as a variable, and then you get two distributions for it: pre and post launch. And you analyze if the two distributions are significantly different from one another. I'm still learning some of the basics of determining statistical significance, and have found this StatSoft resource to be a decent starting point

Recommended reading: The Exp Platform at Microsoft has published some good papers that describe the current state of the art, in particular I would recommend Seven Pitfalls to Avoid when Running Controlled Experiments on the Web and Practical Guide to Controlled Experiments on the Web