analysis of BoardGameGeek ratings on BGG. This post repeats the

contents of that post.

Since I do a lot of statistical analysis on the geek, one category of

questions I get a lot are those about the "validity" of BGG Ratings.

I finally got around to writing up a bunch of the notes on this. Enjoy.

**How much a difference in rating/ranking is signifcant?**

Well, depends what you mean, but I can answer the question I think

most people are asking better than this one. Tests for statistical

"signifcance" are common, but most are based on assumptions that are

simply not valid for BGG ratings. This isn't to say such measures are

completely useless, but they shouldn't be treated as the final word.

The "ratings error" calculated in this manner is somewhere in the

ballpark of 0.2 points. For games with thousands and thousands of

ratings, it is much lower, below 0.1. For games with fewer ratings,

it's more like 0.5 or more. But, the assumptions that go into these

calculations don't hold for BGG, so the numbers are even more

approximate.

Easier to evaluate is what is the chance you (a random BGG user) will

like a particular game better than another game, given their relative

ranks. For this, we don't need to make as many assumptions, as we can

look at the raw ratings distributions for those games. This still has

some issues with sample bias, but it's better. The answer is, knowing

nothing else, if the games are 50 ranks apart, there's a 60% chance

you like the higher ranked one better. If the games are 250 ranks

apart, 70%. 700 ranks, 80%. 2000 ranks, 90%. Now, games at the very

top of the chart (roughly top hundred) actually give higher

confidence. If the games being compared are near the top of the

chart, multiply the difference by a factor of about 2 to 5. So,

roughly speaking, you're 70% more likely to like a game that's 50 to

125 ranks higher, if it's near the top.

In other words, rankings/ratings are a

**rough**estimate. They're

far from meaningless. Between two games, with no other information,

you're more likely to like the one with a better rank. But, if one is

in a genre you like better, by a designer you like better, from a

publisher you like better or uses mechanics you like better, you'll

probably like it more, unless the other game outranks it by a few

hundred ranks.

**Personally,**I tend to look at game ranks in roughly 5 "star"

categories: 1-100 (5 stars), 101-500 (4 stars), 501-1000 (3 stars),

1001-2000 (2 stars) and 2001+ (1 star). If a game has a feature

(designer, publisher, mechanic, theme, etc.) I'm especially fond of, I

give it another star or two. Games with features I tend to dislike,

get docked a star or two. Ratings/reviews from trusted users might

bump it up or down one star, but for me, I don't find many

reviewers/raters who I can consistently trust. Then, if a game has 6

or more stars, I probably buy it before playing. 5 stars, I actively

seek it out to try it. 4 stars, I'm happy to give it a try. 3 stars,

I'm willing to give it a try. 2 stars, I have to be convinced. 1

star, I avoid it. For me, it works.

**Wouldn't the ratings be better/more accurate if we ignored ratings from inactive users?**

They wouldn't be much different. In fact, they'd be only about as

different as you'd expect from any arbitrary reduction of sample size.

I have not yet identified anything to suggest that older/inactive

users ratings are in any substantial characteristic different from

those of active/recent users.

**What if we got rid of ratings that haven't been updated in a certain period**

No substantial change, until you make it a really recent cutoff, at

which point the "top" lists are all exclusively new games.

**What if we just use the plain average instead of the Bayesian average, with a cutoff for minimum number of ratings?**

No matter what value of cutoff you use, it introduces a large bias

toward games that have just barely enough to make the cutoff. In

fact, for any particular value of the cutoff, roughly 20% of the top

games (whether top 10, top 100, whatever) are very close to the

cutoff. What this means is if you were to lower the cutoff a little,

you add in a bunch of games that were arbitrarily removed by having

the cutoff higher. If you raise the cutoff a little, you cut out a

bunch of games, equlaly arbitrarily. The Bayesian average provides a

"soft" cutoff.

Actually, if you're willing to raise the cutoff up to about 500

ratings, minimum, the effect goes away. That would leave only 422

games rated on the geek.

**What if we restricted it to people who have played at least 3 times?**

Well, the average rating of games would go up a ton because people

don't tend to play bad games that many times. Specifically the

average rating would go up by nearly a point.

It would also introduce a big bias against longer games, introduce a

bias toward 2-player games and reduce the sample size dramatically, as

many fewer people log plays than submit ratings. Other than those

shifts, many other results would remain very similar.

**How about a "waiting period" before a game is rated/ranked?**

Well, the Bayesian Average already has some of this effect. That

said, there is a distinct, early ratings bump many games get. That

is, when a game only has a few hundred ratings, it is often rated much

more highly then when it has many hundreds or over 1000 ratings. In

particular, it seems the average dropoff is about 0.3 points from 350

ratings to "steady state", which sometimes takes till 1000 ratings or

more. Before 350 ratings, there's a

**lot**of variability in the

average.

**What if we only count ratings from people who have rated, say, 300 games?**

The top 11 games remain exactly the same, in slightly different order,

despite what would amount to a sample destroying reduction in number

of raters. Neat.

**Wouldn't clusters somehow make this all so much better?**

Oooh, probably.

## No comments:

## Post a Comment