Search


Exit Powerset, Enter Deep Web

Now that we are done with the Powerset hype (also the Cuil hype), The New York Times takes it up on itself to find another avenue to generate hype in the search domain. In all honesty, the avenue is not all that new, we have been hearing about even before the Powerset mania had taken over the world, but I thought we were over and done with it. At least that was the case till the NYT decided to bring out the old skeleton.

First Impressions: Persai

"Blogging Persai" is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title "Flogging Persai" to it. For a bunch of guys who have been extremely trigger happy during their Uncov.com days to stamp almost everything with the dreaded "FAIL," it is rather interesting that their own product is nothing short of a half-baked proof of concept that has been cobbled together for reasons that don't go beyond, well, the fact that it can be done.

Persai, according to the founders, is an ad-supported content recommendation system. Over time, the guys have crawled a truckload of RSS feeds(there used to be a blog entry which said as much, but is not there on the bog anymore, but Sam Ruby has the list here), indexed and classified them and this in turn powers the recommendation system. You can subscribe to "interests" (known as keywords for the rest of humanity) and get sources thrown at you which the system thinks are relevant to you. While you can't do much else with the sources, since Persai does not have a built-in feed reader, you can reject sources. And that is all there is to see about Persai. Well, at least for now.

The problems

Use Case: Recommendation systems have not traditionally fared too well on the internet. Previous players like Greg Linden's Findory used to do a lot more than what Persai even does today and have not done too well at all. In fact, Findory, rather sadly, shut shop recently. The only recommendation system (which works in a stealthy manner) is Google News, which works because they don't blatantly involve you in the recommendation process.

Once you find content on Persai, there is not much to do with it. Fulfillment is a term that is at best very vague on Persai. You can, as they claim, track the topics, but those links lead out the website anyway. Individual interests have RSS feeds that you can subscribe to, but you can already do that with Google News Alerts and other products. I do doubt if anyone is going to use Persai just to have search term driven RSS feeds.

Accuracy: The approach that Persai has taken to classification involves the usage of training data. This approach works well on similar data sets, but the moment you deviate from the similarity, the entropy will be of a magnitude which will send the classifier on a wild goose chase. And as expected, this has an adverse impact on the accuracy of the results. For instance, one of my interests -- "mameo" -- throws back results at me which has nothing to do with Mameo in the first five results. I could, of course, reject these sources and help improve Persai, but why would anyone do that when there are other avenues that provide me with much more accurate results?

Speed: To do classification, Persai is already using Hadoop's MapReduce. Mapreduce does an amazing job of distributively processing huge chunks of data (freshly crawled data to be indexed and classified in this case), but it may only help Persai to a certain extent. The reasons for this are simple: If they process interests as unique to each user, it just won't scale up. There will be numerous threads doing classification for the same interests since they are unique.

And if the interests are not tracked as a unique item per user, it can play havoc with the results with different users rejecting different sources for different reasons. Of course, there are workarounds for it by using a mix of both approaches (classify as non-unique, filter on display by excluding user-specific rejection criteria), but in the end it ends up being a hack.

In any case, the approach results in tremendously outdated results. Some of the interests have really old articles on top. This could also be due to the fact that the sources are manually added into the system, which means that the quality and spread of the sources will be dependent on the bias of the person who is selecting them. Moreover, it another issue that sites without RSS feeds will not be able get into Persai.

Splogs: Possibly the group that will be over the moon about Persai would be the thugs who run splogs. With Persai it becomes ridiculously easy to set up automated blogs based on topics and, honestly, I see more people using Persai for this than anything else.Considering that Persai is still in beta, I would not give it the "FAIL" rating, but I would certainly give it the "FRAIL" rating. I hope it becomes a much better by the time it comes out of private beta.

Clustered river of news

RSS readers have over time become pretty fully-featured software on their own. Most now provide the standard set of features: OPML import/export, categories, river of news and search irrespective of their avatar -- online or offline -- and I have pretty much grown used to depending on my reader of choice Google Reader to satisfy the need to read my feeds.

That said, there is one feature I'd really love to have in my RSS reader - to have clustering on feeds as an additional way to categorise data, other than the current methods of categories and tags. Think of it as a cross between your RSS reader and Google News/Techmeme. Would it not be nice to have your little personal Google News or Techmeme from the sources that you have picked than be led by what Gabe or the kind folks at Google News may have seeded their websites with?

There are, though, a couple of problems that could make this impossible:

Processing: Any algorithm that finds similarities in text is computationally intensive even in cases where the data set is limited. Scaling is often possible in such circumstances when the size of the data set is reasonably fixed and with the variance that comes in the size of different RSS subscription lists, it would be a royal pain to find a right algorithm that will scale effectively and efficiently.

Entropy: Traditional similarity match approaches work best when they cover a similar domain so that an apple would mean apple the fruit rather than Apple the company. The entropy that is found in the data set needs to be reasonable for the algorithm to function reasonably well and learning systems also need to be taught with training data, which may not be possible in this case.

Link Match: What we are then left with is to hit the problem purely by tracking outgoing links. This would thankfully involve a far less computationally intensive approach than going via the pure text analysis approach. The degree of accuracy and the utility this approach may have may not be stunning, but it would certainly be good enough for the immediate purpose - a reasonable way of classifying what my subscription list is talking about.

Related articles:

RSS Clustering: A Unique Approach for Managing Your RSS Feeds
A Novel Clustering-based RSS Aggregator
Nearest Neighbors and Similarity Search by Yury Lifshits