Monday, January 18, 2010

The Harry Potter problem

(Updated on 1/25 per comments below)

No, the problem I bring is not from the land of Harry Potter but rather about the effect this series of books has on similarity recommendations. Greg Linden wrote a post on this subject a while ago. The problem surfaces in item-to-item collaborative filtering, the kind reputed to have been used by Amazon. The intuition behind item-to-item collaborative filtering is that if a user buys product X and later product Y, then X and Y must be related or similar in some way. Now, the problem statement claims that the Harry Potter series is so highly popular that most customers would have bought it, hence it may not have much in common with the other books you may have bought. Nevertheless, if one were to apply plain collaborative filtering, it may pop up as one of the top similar products for most books, thereby diluting the relevance of the similarity feature. I had read Greg's post a while ago, but hadn't given it much thought until recently: so how severe is the Harry Potter problem?

One of the standard ways of applying collaborative filtering here is similarity measures, a popular one being cosine similarity. You represent all customers who bought product A as a vector, and do the same for product B. Then sim(A, B) = A.B / |A||B|. If A is the product a user has already bought, then you wish to find B such that it maximizes sim(A, B). One can see that if a Harry Potter book, H, had a large enough set of customers common with A, then it may drown out other products with fewer purchase data. Also, cosine similarity is symmetric, which does not appear to be a desirable property here: if say, "Seven habits of highly effective people" is 60% similar to a Harry Potter book, is the vice versa true as well? It wouldn't seem so, especially when this is posed as the question: which product will a customer buy next?

That formulation of the problem (what is the customer likely to buy next) leads to a probabilistic definition for recommendations: P(B|A) = P(AB)/P(A): what is the probability of a customer buying B, given that s/he purchased A? Since A is a given, we want to find B such it maximizes P(AB) in the RHS. As Richard Cole points out in comments below, data sparsity often poses problems here, i.e. you might find P(AB) to be zero for some Bs. To not discount such events that have not occurred in the training data yet, the numerator and denominator are often boosted with priors. He also points out other approaches below of dealing with data sparsity. I suppose quantifying the effect of the Harry Potter problem is hard because it depends entirely on the dataset, and getting access to real-world datasets is hard. The Netflix challenge (see below) is the best known publicly available commercial dataset. I guess I should try out these approaches on it!

Recommended reading:


  • The Netflix challenge was an interesting event on this front, here's the winning team's strategy. Note thought that the focus there was more on extracting temporal relations from a movie reviews database, than with dealing with the Harry Potter problem directly

5 comments:

  1. I read part of you blog post Nagender as the ending was kind of hard for me to interpret. That's some pretty deep thinking. I was just thinking that the distortion that the Harry Potter books is causing will happen with any must buy item wont it? If something is very popular then it can be bought along with another random unrelated item as you described for the Harry Potter books. A popular item is similar to so many others as you described because everyone is buying it. Therefore, I would assume there is no need to recommend such popular items as it would make no difference to the popularity of that item if it was included on a list of recommendations (since everyone knows what it is and is buying it already). Then you can filter it out of the list instead of including it right? Or are the harry potter books so popular that their popularity is one of a kind?

    ReplyDelete
  2. Hi Aditya,

    You analysis is correct: if Harry Potter books are indeed so popular then does it ever make sense to recommend them? We call them outliers (their sales number are far removed from the average sales distribution of products). And the approach you suggest is indeed a valid one: if it rarely makes sense to recommend such outliers, then why not just remove them from consideration entirely?

    I don't know if that's how Amazon, Netflix and others handle this problem. But the open literature on the subject leads me to believe that they perhaps don't exclude outliers. Why not? Maybe there are cases when it would be good to recommend a Harry Potter book. A simple example is if I were browsing Harry Potter DVDs. But I'm not so sure if leaving outliers in is the best approach either. The Netflix challenge presented some interesting ideas on recommendation systems.

    ReplyDelete
  3. Maybe it is just because I'm tired, but I too didn't follow the last bit. It is not necessarily the case that P(A|B) = P(B|A), so conditional probability is a candidate for a non-symetric similarity measure*. So one would expect that the probability of buying the "Blue Book" given a purchase of the "Lisp in Small Pieces" would be much higher than the probability of purchasing Harry Potter.

    That said of course your training set may be too small to get accurate measurements of P(AB) for many AB combinations and so you might choose the drop the top and bottom 15% of events by frequency from your similarity metric as is commonly practiced in the information retrieval community. Or you might go for top sellers, e.g. argmax P(A) :)

    One might also look to maximise P(A|B) / P(A) = P(AB) / P(A)P(B), that is deviations from a conditional probability assumption. See Richard J. Cole, Peter Bruza: A Bare Bones Approach to Literature-Based Discovery: An Analysis of the Raynaud's/Fish-Oil and Migraine-Magnesium Discoveries in Semantic Space. Discovery Science 2005: 84-98 for more details.

    * here measure is used counter its mathematical sense.

    ReplyDelete
  4. You are correct, Rich, thanks for catching that. I have corrected it above.

    I tried looking online for the paper you referenced above but couldn't find it. Do you have a copy of it?

    ReplyDelete
  5. I'll have to look around my old hard drives I couldn't find a PDF online.

    ReplyDelete