Monday, January 18, 2010

The Harry Potter problem

(Updated on 1/25 per comments below)

No, the problem I bring is not from the land of Harry Potter but rather about the effect this series of books has on similarity recommendations. Greg Linden wrote a post on this subject a while ago. The problem surfaces in item-to-item collaborative filtering, the kind reputed to have been used by Amazon. The intuition behind item-to-item collaborative filtering is that if a user buys product X and later product Y, then X and Y must be related or similar in some way. Now, the problem statement claims that the Harry Potter series is so highly popular that most customers would have bought it, hence it may not have much in common with the other books you may have bought. Nevertheless, if one were to apply plain collaborative filtering, it may pop up as one of the top similar products for most books, thereby diluting the relevance of the similarity feature. I had read Greg's post a while ago, but hadn't given it much thought until recently: so how severe is the Harry Potter problem?

One of the standard ways of applying collaborative filtering here is similarity measures, a popular one being cosine similarity. You represent all customers who bought product A as a vector, and do the same for product B. Then sim(A, B) = A.B / |A||B|. If A is the product a user has already bought, then you wish to find B such that it maximizes sim(A, B). One can see that if a Harry Potter book, H, had a large enough set of customers common with A, then it may drown out other products with fewer purchase data. Also, cosine similarity is symmetric, which does not appear to be a desirable property here: if say, "Seven habits of highly effective people" is 60% similar to a Harry Potter book, is the vice versa true as well? It wouldn't seem so, especially when this is posed as the question: which product will a customer buy next?

That formulation of the problem (what is the customer likely to buy next) leads to a probabilistic definition for recommendations: P(B|A) = P(AB)/P(A): what is the probability of a customer buying B, given that s/he purchased A? Since A is a given, we want to find B such it maximizes P(AB) in the RHS. As Richard Cole points out in comments below, data sparsity often poses problems here, i.e. you might find P(AB) to be zero for some Bs. To not discount such events that have not occurred in the training data yet, the numerator and denominator are often boosted with priors. He also points out other approaches below of dealing with data sparsity. I suppose quantifying the effect of the Harry Potter problem is hard because it depends entirely on the dataset, and getting access to real-world datasets is hard. The Netflix challenge (see below) is the best known publicly available commercial dataset. I guess I should try out these approaches on it!

Recommended reading:


  • The Netflix challenge was an interesting event on this front, here's the winning team's strategy. Note thought that the focus there was more on extracting temporal relations from a movie reviews database, than with dealing with the Harry Potter problem directly