skip to main | skip to sidebar
Loading...
You have reached the blog of
Matthew Gray
matthew@gray.org
I am a father, board gamer and software engineer.

Internet

In addition to my blog (this page), you can find me on BoardGameGeek, Twitter, Flickr, LinkedIn, FriendFeed, and various other places. I also have a slightly stale homepage.

Personal

I am an avid board gamer. I am one of the (volunteer) admins of BoardGameGeek, maintainer of the GameStoreDB, board game blogger, and gaming software geek.

Professional

I am a staff software engineer at Google. Previously, I was the CTO at an 802.11 location and security company, Newbury Networks in Boston. In June, 1999 I received my Masters degree from the MIT Media Lab. I graduated from MIT (undergraduate) in June, 1997, in physics. Prior to that I was CTO of net.Genesis from 1994 to 1996.

While at MIT, I was one of the three members of the Student Information Processing Board (SIPB) who set up www.mit.edu in the spring of 1993. I am also a former/inactive member of the Apache group, a volunteer group of developers of Apache, the world's most popular web server.

Blog

Tuesday, January 4, 2005

Del.icio.us Categories

Updated: The categories it was picking weren't very good. Part of the reason for this was that the hierarchy wasn't visible and it wasn't very good at picking "cut-off" points in the hierarchy. I made a new version that shows the hierarchy in the same style as the earlier clustering example. It's much better. The thicker, darker lines correspond to more strongly associated URLs. del.icio.us hierarchy. Plus, now both of them use recent popular links which improves it's knowledge a bit (more popular means more tags to learn from).



Continuing in the same vein as my earlier experiment with
delicious clusters, I decided to turn it around
try clustering of the URLs, which essentially means categorizing them. It's been pointed out that a problem with these tagging interfaces (aka folksonomies) are lack of synonyms and the like. My clustering system helps address this as well as related but non-synonymous terms. In any case, using the clustered taxonomy generated from the tags it is possible to cluster the URLs into categories. I created a first pass of a del.icio.us categories page. It works ok, and I believe I know how to make it work better.

What it does is take about 150 recent postings from del.icio.us, strip out the ones with no tags and the ones with unique tags and apply the historical tag clustering database to identify which "tag clusters" it is part of. From there, it identifies which URLs tag clusters are similar to one another and creates URL clusters based on that. These are the categories. It's got some basic heuristics for determing when to stop glomming categories together. Once it has all the categories, it tries to pick a good name for them by grabbing the most common tags for URLs in that category.

It works well, but far from perfectly. There are a few big caveats; The categories it produces are based on the recent URLs. In the grand scheme of things "computers" may be a big category, but if only one or two recent entries are about computers, it may not even get a category. The names for the categories are often wrong or misleading. Often a category will be meaningful and coherent, but slightly (or oddly) mislabeled. For example, earlier today, there was a category "recipes". The category would have been better labeled food, as it included several food but non-recipe related sites, but it didn't know any better. It will sometimes leave links uncategorized that should obviously go in one of the generated categories. Sometimes this is due to odd tagging and sometimes it is due to insufficent tags. Finally it sometimes puts a URL in an entirely inappropriate category.

I hope to soon try to address some of these by using more extensive tag histories of a URL, only URLs with a "critical mass" of tags and by improving the "tag clusters". Check it out..

Posted by Matthew Gray at 7:45 PM
Labels: delicious, Technology

0 comments:

Post a Comment

Newer Post Older Post Home
Posts feed Add to Google Reader or Homepage Subscribe in Bloglines // Comments feed

Recently Played

www.flickr.com

Popular Posts

Blog Archive

  • ►  2010 (2)
    • ►  January (2)
      • 2009 Games Summary
      • Ten Years of Games
  • ►  2009 (6)
    • ►  September (2)
      • People who are unintentional "spoilers"
      • Davis Mega Maze via GPS
    • ►  April (1)
      • Google's architecture through the eyes of a 4-year...
    • ►  March (1)
      • Mozy review: It doesn't work
    • ►  February (1)
      • Recent sci-fi reading
    • ►  January (1)
      • 2008 Games Summary
  • ►  2008 (27)
    • ►  December (1)
      • Lanna Thai Diner Review
    • ►  November (5)
      • Atom feed of your recently played games
      • In praise of short games
      • 1000 different games
      • Simple election Monte Carlo toy
      • Request for online photo hosting/sharing suggestio...
    • ►  October (2)
      • Played a bunch of new games
      • BaordGameGeek and AppEngine
    • ►  July (3)
      • Quick iPhone app reviews
      • I play games with other people
      • Almost 4, Almost a boardgame geek
    • ►  June (2)
      • Spin and Axis reviews
      • Amusing StreetView vignette
    • ►  May (8)
      • 2008 SdJ Virtual Stock Market
      • Next stage of migration and a warning
      • Luck, Skill and Experience in games
      • Moved to Reading & Commute Analysis
      • I am a social network
      • Race for the Galaxy and variety
      • All my blogs
      • Trying out Blogger
    • ►  April (1)
      • My ScanSnap Workflow
    • ►  March (2)
      • Much Better, reprise
      • A couple more: Parkour and Speed Stacking
    • ►  January (3)
      • Every Year Games
      • Games of the year, 2007
      • 2007 Games Summary
  • ►  2007 (12)
    • ►  September (1)
      • Gaming impact of children
    • ►  April (5)
      • Hot at the Gathering
      • Buy, Maybe, No Buy
      • Newly played at the Gathering, Brief Comments, Tue...
      • Friedemann charms a 2-year-old
      • Gathering 2007, appetizer
    • ►  March (3)
      • Game Card Catalog
      • Gamer, age 2
      • Full Circle
    • ►  January (3)
      • BoardGameGeek Ratings
      • 2006 Games Report
      • Game Metrics for 2006
  • ►  2006 (36)
    • ►  December (2)
      • Games of the year, 2006
      • My BGG tools and toys
    • ►  November (3)
      • Great Service
    • ►  September (3)
    • ►  August (3)
    • ►  July (1)
    • ►  June (2)
    • ►  May (5)
    • ►  April (8)
    • ►  March (2)
    • ►  February (1)
    • ►  January (6)
  • ▼  2005 (62)
    • ►  December (3)
    • ►  November (3)
    • ►  October (6)
    • ►  September (2)
    • ►  August (4)
    • ►  July (3)
    • ►  June (4)
    • ►  May (4)
    • ►  April (13)
    • ►  March (7)
    • ►  February (3)
    • ▼  January (10)
      • Blizzard of January 2005
      • Heroscape Unit Creator improvements
      • Userscript: Delicous Sort
      • Greasemonkey 0.2
      • 2004 Games Report
      • Greasemonkey: Interstitial Skipper
      • Grease Monkey
      • Del.icio.us Categories
      • Game Metrics for 2004
      • 5s and 10s for 2004
  • ►  2004 (87)
    • ►  December (5)
    • ►  November (4)
    • ►  October (10)
    • ►  September (5)
    • ►  August (7)
    • ►  July (6)
    • ►  June (10)
    • ►  May (6)
    • ►  April (10)
    • ►  March (11)
    • ►  February (4)
    • ►  January (9)
  • ►  2003 (17)
    • ►  December (7)
    • ►  November (5)
    • ►  October (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
  • ►  2002 (43)
    • ►  December (2)
    • ►  November (2)
    • ►  October (3)
    • ►  September (3)
    • ►  August (3)
    • ►  June (1)
    • ►  May (6)
    • ►  April (7)
    • ►  March (13)
    • ►  February (3)

Disclaimer

I work for Google as a Software Engineer. This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.