Tuesday, January 4, 2005

Del.icio.us Categories

Updated: The categories it was picking weren't very good. Part of the reason for this was that the hierarchy wasn't visible and it wasn't very good at picking "cut-off" points in the hierarchy. I made a new version that shows the hierarchy in the same style as the earlier clustering example. It's much better. The thicker, darker lines correspond to more strongly associated URLs. del.icio.us hierarchy. Plus, now both of them use recent popular links which improves it's knowledge a bit (more popular means more tags to learn from).

Continuing in the same vein as my earlier experiment with
delicious clusters, I decided to turn it around
try clustering of the URLs, which essentially means categorizing them. It's been pointed out that a problem with these tagging interfaces (aka folksonomies) are lack of synonyms and the like. My clustering system helps address this as well as related but non-synonymous terms. In any case, using the clustered taxonomy generated from the tags it is possible to cluster the URLs into categories. I created a first pass of a del.icio.us categories page. It works ok, and I believe I know how to make it work better.

What it does is take about 150 recent postings from del.icio.us, strip out the ones with no tags and the ones with unique tags and apply the historical tag clustering database to identify which "tag clusters" it is part of. From there, it identifies which URLs tag clusters are similar to one another and creates URL clusters based on that. These are the categories. It's got some basic heuristics for determing when to stop glomming categories together. Once it has all the categories, it tries to pick a good name for them by grabbing the most common tags for URLs in that category.

It works well, but far from perfectly. There are a few big caveats; The categories it produces are based on the recent URLs. In the grand scheme of things "computers" may be a big category, but if only one or two recent entries are about computers, it may not even get a category. The names for the categories are often wrong or misleading. Often a category will be meaningful and coherent, but slightly (or oddly) mislabeled. For example, earlier today, there was a category "recipes". The category would have been better labeled food, as it included several food but non-recipe related sites, but it didn't know any better. It will sometimes leave links uncategorized that should obviously go in one of the generated categories. Sometimes this is due to odd tagging and sometimes it is due to insufficent tags. Finally it sometimes puts a URL in an entirely inappropriate category.

I hope to soon try to address some of these by using more extensive tag histories of a URL, only URLs with a "critical mass" of tags and by improving the "tag clusters". Check it out..

No comments:

Post a Comment