A mysterious yet intriguing project from Russia has come across our inbox. It is a search-engine optimization analysis tool for Websites called TheRarestWords. For any given URL, like Microsoft’s or Techcrunch’s, it shows you the rarest keywords on the homepage (i.e., the ones most likely to give your site some search-engine juice), other sites with related keywords, and a list of categories the site would fit under based on those keywords. For Microsoft, some the rare keywords it identifies are “silverlight,” “biztalk,” “onecare,” “skydrive, “popfly,” “ballmer,” and “ozzie.” You can try your site by going to http://therarestwords.com/YOURSITE.com.
TheRarestWords then tries to tap into crowd intelligence by letting anyone add a 100-character definition for each keyword, which could give it a semantic edge in trying to categorize each site. This could also be gamed pretty easily, but this looks to be just a Web project at this point. It could also be used to create a Wiki dictionary like Lingoz or Wiktionary, but that does not seem to be the focus of the project.
The author and the sole founder – who is from Russia and wants to have a low profile for now – says it is just a hobby that was started in December 2007 and he calls it a “linguistic experiment”.
Their spider (called TheRarestParser/0.2a) started scouting the internet in May and extracted words from many websites. It looked at which one are used most often on those websites and which ones are rarely used, or not at all. For now it extracts only the words from the first page of a domain. It doesn’t go deeper than that, however the spider managed to index 20 million words from many domains.
The author wants to implement new options like:
* Trend spotting (which of the words are gaining popularity – like “django” is becoming more popular, “python” is still strong, and which are losing it like “perl”)
* Help with SEO for mom-and-dad kinds of business sites (it could be useful from this stand point, the author says)
* Auto-categorization of your sites against a big list of categories (actually, at this time it has already been implemented, but the algorithm still needs to be perfected)
The interface is confusing the first time you go there, but there is some interesting data you can pull from it. For instance, you can have an SEO fight between any two sites by typing in the address: http://therarestwords.com/vs/your-site.com/competitors-site.com. This feature shows which rare words your site has that your competitor doesn’t and vice versa.
For example, here’s TechCrunch Vs. GigaOm. This is only a snapshot of what is on each frontpage, but we are more likely to get search traffic right now for terms like “friendfeed,” “gamestop,” and “blogosphere.” While they are kicking our butts on “qualcomm,” “powerset,” and “sarcasm.” (At least that was the case before I put up this post. I really can’t let Om beat us on sarcasm).