The FuzzyBlog! : Rethinking Search and Retrieval for Blogs

The FuzzyBlog!

Marketing 101. Consulting 101. PHP Consulting. Random geeky stuff. I Blog Therefore I Am.
Home FuzzyGroup About Us Our Services

Rethinking Search and Retrieval for Blogs

A friend asked me about helping out on search and retrieval for blogs.  He's implemented a system, BlogStreet, which displays "Blog Neighborhoods" or collections of blogs that are related by a dynamic analysis of their blogroll.  This is a very cool concept and now he's implemented search and retrieval across their 13,000 blog url database and is finding that it isn't working so well.  He initially asked me for a technology recommendation and I pointed him towards mnoGoSearch.  Today he came back to me for help and, as a consultant, I asked (immediately) about two things:
  • Could I officially bid on the work?
  • What's the budget?
When I found out the budget I realized "NASA we have a problem".  In short his budget was, I felt, off by a factor of 10 or more.  And, dear reader, that's what has brought us to this essay.  So let's start from scratch:

Goal: To Build a Search and Retrieval System Across Weblogs

And what goes along with any goal is a constraint.  Any constraint but here we have a very real one or:

Constraint: Don't Break the Bank

If you think about it, indexing blogs is basically the same as indexing web sites with some important constraints as follows:
  • Permalink awareness is really needed.  Finding an entry on a blog home page is pretty much useless unless you search the same day it’s indexed or the blog doesn’t change very often
  • Information that is repeated in the website template really shouldn’t be indexed if at all possible since it will dramatically screw up search results if that word or phrase
  • Currency is extremely important.  Weblogs change much more frequently than webpages so they need to be indexed more regularly.  Much more regularly.
  • There isn’t real money in it (today).  This may well change but the current economics of blogging being an amateur thing means that major capital infrastructure investments such as a big data center simply isn’t going to happen.

 

OPML is a wicked cool way to display lightweight hierarchies of information.  Its an easy to implement (I did it in less than an hour for a FAQ application), xml based, simple specification.  It works and the author should be gosh dang proud of it.  Here's the rub: OPML is displayed as XML tags in the browser.  Here's what you see in IE:

Here's the URL

To me, the view in IE is unacceptable.  This makes outlining a geek curiousity rather than a mainstream thing.  Yes, in a true outliner, the results will be better but we need a way for people to view this in HTML.  I'd really like people to see my outlines now but with only Radio users able to get to them, it's a chicken and egg situation. Here's my recommendation.  And it isn't all that hard.

This is a Distributed Rendering Problem

Here are the issues as I see it:
  1. Take an OPML url and generate HTML from it for display.  XSLT, DHTML, etc. Who cares?  Let's get it done so that "Mom" or "GrandPa" can view it.  (No disrespect to highly technical Moms and GrandPas out there, this is a metaphor).  Edit or view, who cares?  Have to start somewhere and View is easier.
  2. Give a link to the actual OPML url so that if people have a mime compliant OPML editor, it can be edited.  OPTIONAL: Let people have a preferences facility to bookmark them and share them. 
  3. Do it without breaking the bank on hardware.
That last point is the hard one.  Here's my solution:
  1. Write this in a commonly available web language currently installed on over 3,000,000 hosts world wide that also happens to be network ready, xml capable and really, really easy to get stuff done in.  Sure, we'd all love to use Zope or Python or ExoticLangOfTheDayHere.  Guess what: PHP's what I recommend.  It meets these criteria and more.

      It’s wicked portable, fast enough and has none of the install problems with Perl scripts (flames to sjohnson@fuzzygroup.com).

  2. Write a renderer in PHP.  Make it smart enough to update its rendering params from a server periodically.  Make it accept one parameter, the OPML file to render.
  3. Write this code so it's drop dead simple to install on a server.  Make it "ioview.php", no includes.  Copy it into a website and go. 
  4. Let people who download it and install it sign up with UserLand as an "OPML Partner".  Award "Karma Points" if they do it.
  5. Let UserLand operate a redirector service which forks IO rendering requests out at random to different servers all over the globe.  This could probably be done with one or two Linux boxes.  Sure we could make it fancy but let brute force solve it for now.  Heck, all UserLand really has to do is own the DNS entries and a little tiny bit of hardware to jumpstart it.
  6. Ask the Radio community to help out.  I have right now 3 boxes I could register.  I don't mind giving up a little cpu and bandwidth.
  7. Do something with the "Karma Points".  Have a pot luck supper or something.  Who cares.  We'll do it because we're a community and we believe.  The karma is just an idea.
I'm willing to help.  Anyone else?  I can devote both IQ, coding and cpu to it.  There have to be a lot of boxes out there with light loads.
 

This Page was last update: 10/3/2002; 9:05:07 PM

Copyright 2002 The FuzzyStuff

Theme Design by Bryan Bell

Click here to visit the Radio UserLand website. Subscribe to "The FuzzyBlog!" in Radio UserLand. Click to see the XML version of this web page. Click here to send an email to the editor of this weblog.