Wednesday, January 17, 2007

BoardGameGeek Ratings

I've posted some
analysis of BoardGameGeek ratings
on BGG. This post repeats the
contents of that post.

Since I do a lot of statistical analysis on the geek, one category of
questions I get a lot are those about the "validity" of BGG Ratings.
I finally got around to writing up a bunch of the notes on this. Enjoy.

How much a difference in rating/ranking is signifcant?

Well, depends what you mean, but I can answer the question I think
most people are asking better than this one. Tests for statistical
"signifcance" are common, but most are based on assumptions that are
simply not valid for BGG ratings. This isn't to say such measures are
completely useless, but they shouldn't be treated as the final word.
The "ratings error" calculated in this manner is somewhere in the
ballpark of 0.2 points. For games with thousands and thousands of
ratings, it is much lower, below 0.1. For games with fewer ratings,
it's more like 0.5 or more. But, the assumptions that go into these
calculations don't hold for BGG, so the numbers are even more
approximate.

Easier to evaluate is what is the chance you (a random BGG user) will
like a particular game better than another game, given their relative
ranks. For this, we don't need to make as many assumptions, as we can
look at the raw ratings distributions for those games. This still has
some issues with sample bias, but it's better. The answer is, knowing
nothing else, if the games are 50 ranks apart, there's a 60% chance
you like the higher ranked one better. If the games are 250 ranks
apart, 70%. 700 ranks, 80%. 2000 ranks, 90%. Now, games at the very
top of the chart (roughly top hundred) actually give higher
confidence. If the games being compared are near the top of the
chart, multiply the difference by a factor of about 2 to 5. So,
roughly speaking, you're 70% more likely to like a game that's 50 to
125 ranks higher, if it's near the top.

In other words, rankings/ratings are a rough estimate. They're
far from meaningless. Between two games, with no other information,
you're more likely to like the one with a better rank. But, if one is
in a genre you like better, by a designer you like better, from a
publisher you like better or uses mechanics you like better, you'll
probably like it more, unless the other game outranks it by a few
hundred ranks.

Personally, I tend to look at game ranks in roughly 5 "star"
categories: 1-100 (5 stars), 101-500 (4 stars), 501-1000 (3 stars),
1001-2000 (2 stars) and 2001+ (1 star). If a game has a feature
(designer, publisher, mechanic, theme, etc.) I'm especially fond of, I
give it another star or two. Games with features I tend to dislike,
get docked a star or two. Ratings/reviews from trusted users might
bump it up or down one star, but for me, I don't find many
reviewers/raters who I can consistently trust. Then, if a game has 6
or more stars, I probably buy it before playing. 5 stars, I actively
seek it out to try it. 4 stars, I'm happy to give it a try. 3 stars,
I'm willing to give it a try. 2 stars, I have to be convinced. 1
star, I avoid it. For me, it works.

Wouldn't the ratings be better/more accurate if we ignored ratings from inactive users?

They wouldn't be much different. In fact, they'd be only about as
different as you'd expect from any arbitrary reduction of sample size.
I have not yet identified anything to suggest that older/inactive
users ratings are in any substantial characteristic different from
those of active/recent users.

What if we got rid of ratings that haven't been updated in a certain period

No substantial change, until you make it a really recent cutoff, at
which point the "top" lists are all exclusively new games.

What if we just use the plain average instead of the Bayesian average, with a cutoff for minimum number of ratings?

No matter what value of cutoff you use, it introduces a large bias
toward games that have just barely enough to make the cutoff. In
fact, for any particular value of the cutoff, roughly 20% of the top
games (whether top 10, top 100, whatever) are very close to the
cutoff. What this means is if you were to lower the cutoff a little,
you add in a bunch of games that were arbitrarily removed by having
the cutoff higher. If you raise the cutoff a little, you cut out a
bunch of games, equlaly arbitrarily. The Bayesian average provides a
"soft" cutoff.

Actually, if you're willing to raise the cutoff up to about 500
ratings, minimum, the effect goes away. That would leave only 422
games rated on the geek.

What if we restricted it to people who have played at least 3 times?

Well, the average rating of games would go up a ton because people
don't tend to play bad games that many times. Specifically the
average rating would go up by nearly a point.

It would also introduce a big bias against longer games, introduce a
bias toward 2-player games and reduce the sample size dramatically, as
many fewer people log plays than submit ratings. Other than those
shifts, many other results would remain very similar.

How about a "waiting period" before a game is rated/ranked?

Well, the Bayesian Average already has some of this effect. That
said, there is a distinct, early ratings bump many games get. That
is, when a game only has a few hundred ratings, it is often rated much
more highly then when it has many hundreds or over 1000 ratings. In
particular, it seems the average dropoff is about 0.3 points from 350
ratings to "steady state", which sometimes takes till 1000 ratings or
more. Before 350 ratings, there's a lot of variability in the
average.

What if we only count ratings from people who have rated, say, 300 games?

The top 11 games remain exactly the same, in slightly different order,
despite what would amount to a sample destroying reduction in number
of raters. Neat.

Wouldn't clusters somehow make this all so much better?

Oooh, probably.

No comments:

Post a Comment