Hi there
Not sure if this is the right place to ask but I couldn't find any
other general mysql newsgroups. If anyone could have a look through
and help me out that'd be really great...
Basically what i'm looking to do is a feature like the thing amazon
does where it encourages users to rate different products (in my case
though it's bands on a music site they're rating). Then it generates
recommendations for the users based on what people who like the same
stuff as you also like. It has to be done from an existing mysql
database which has tables like so
users:
userid (primary key)
other info
bands:
bandid (primary key)
name
other info
ratings:
userid
bandid
rating
So a ratings object is created when a user from the users table rates
a band from the bands table. Say we've built up a decent amount of
ratings and they're stored as integers, -10 for really hates
something, 0 for indifferent, 10 for really loves it. say.
I need a query which will print out for a particular user a list of
bands which they /haven't/ already rated themselves, but where other
people they have a good level of agreement with rate them well. ie
recommendations for what the user might like. Maybe i'm doing the
wrong thing in trying to put this all into one query. I suspect it'd
run terribly slowly as it stands, but here's my first attempt. I've
tried to give the tables/columns meaningful names.
select
other_bands.name,
other_bands.bandid,
sum(your_ratings_of_bands.rating
other_peoples_ratings_of_same_bands
these_peoples_ratings_of_other_bands) as total_recommendation_score
from
ratings as your_ratings_of_bands,
ratings as other_peoples_ratings_of_same_bands,
ratings as these_peoples_ratings_of_other_bands
bands as other_bands
where
your_ratings_of_bands.userid = 1234 and
other_peoples_ratings_of_same_bands.userid != 1234 and
other_peoples_ratings_of_same_bands.bandid =
your_ratings_of_bands.bandid and
these_peoples_ratings_of_other_bands.userid =
other_peoples_ratings_of_same_bands.userid and
these_peoples_ratings_of_other_bands.bandid !=
your_ratings_of_bands.bandid and
other_bands.bandid = these_peoples_ratings_of_other_bands.bandid
group by
other_bands.bandid
order by
total_recommendation_score desc
limit 20;
Say the user we're finding recommendations for has userid 1234.
The idea behind it was, first it finds all the ratings that user has
made, in a table your_ratings_of_bands. Then, by a join it finds for
each band you've rated, all the ratings by /different/ people of the
same band, in the table other_peoples_ratings_of_same_bands. Then
finally it finds other ratings which the same group of people have
made, of /different/ bands (ie ones you haven't rated yourself).
Grouping by this last group of other bands, it calculates a kind of
total agreement score which is the sum over all the bands you like and
all the other people that have rated those bands of:
the score you gave a band the score someone else gave that band
(a measure of how well you agree with that person in this case)
the score this person gave the other band you haven't heard of
which should when summed up for the other band give an indication of
how well suited that other band are for you. It works both ways
because of the positive and negative ratings, so if someone hates
eveything you like and also hates a another band, that's jst as much
an endorsement as if they like everything you like. in practise I
might put some weighting in so positive agreement is worth more than
negative agreement so to speak, but for now i'm just interested in
making the query go fast enough. As it uses kinda 2 consecutive joins
on the same table i suspect it'll be slow. I'd like it to be scalable
for numbers of users and bands in the thousands.
Another idea I had which might be better, is building up some kind of
index table as a regular overnight cron job, and then using this to
generate the ratings more speedily. But it's running on a shared
server and I can't run stuff that ties up the mysql server for ages
and ages as a cron job. The only helpful things I could think to index
like that would be a measure of agreement in ratings between each pair
of authors, and/or a measure of agreement in who likes them between
each pair of bands... although that would mean regenerating a pretty
massive table every night, and i'm not sure how to do that efficiently
using just selects and inserts. maybe each time someone makes a rating
it could add data to various index tables as it goes kinda thing? any
ideas?
any help gratefully recieved
--
Matthew Willson
http://www.drownedinsound.com/