Archive for March, 2010

A Tempest in a Teacup

Wednesday, March 24th, 2010

So now that Lexascope and Tempest are out in the open, I thought a small blog post on some of the implementation details would be of interest.  Plus I haven’t written one of these for a long time and I’m getting nagged :)

Early on (within the first couple of weeks) of getting the project off the ground, we decided that we were going to go with a cloud based architecture rather than host everything ourselves and AWS seemed to be the no brainer choice.  It offered excellent reliability, control over our costs and relatively simple API’s for us to work with.  Oh it also offered Windows based images, which for some of us (yes I mean me) is only a good thing.  My Linux skills are the thing of legend

We had a couple of abortive starts, as most projects tend to do, for instance we foolishly assumed that an EBS volume could be attached to multiple EC2 instances at the same time.  Ooops!  Back to the drawing board on that method of communication then.

The other big issue we had early on was our choice of MySQL as the central database storage mechanism.  In our original plan, a lot of calculations and rollups were going to occur server side, so we had functions that enabled you to do things like request the top 10 themes for my project etc.  However our early tests indicated that the almost constant inserts coupled the heavy processing load that these complex queries produced meant that a single database wasn’t going to be up to scratch.

Interestingly though, discussions with potential users of the system indicated that they would prefer access to the raw data so we made some changes based on this and switched our storage mechanism from MySQL to be Amazons Simple DB, as we no longer needed to do any complex rollups.

So where are we now.

The system basically consists of 3 groups of machines.

Feed Gathering – These processes are responsible for getting all the RSS and ATOM based content into the system, including Twitter content.  When a feed gatherer is told to go get a particular feed it:

  1. Grabs the specified feed and checks for new content
  2. Follow the links to any new content and extract the text from the HTML
  3. Enhance the text with Salience and and User Defined Categories
  4. Insert the Results into the SDB store

It then goes into an idle state until it is told to go get its next bit of work.

User Content Gathering – These deal with all the content that gets added directly by users such as URLs and load files.  It basically monitors the content queue for new user added content, enhances any assets that we find and then inserts the results into the SDB store

Project Mapping – When feed assets are inserted into the store they aren’t mapped to a particular user or projects.  This is because regardless of how many people have added a particular feed we only go and get it once.  Project mapping maps the inserted assets to the correct projects a the correct time.  It also handles the insert limits that the various account types have.

There are also various ancillary processes that handle things like deleting expired content and logging what’s actually going on in the system.

By far the biggest load is currently in Feed Processing and I expect this to continue.  Especially as we go and get the content pointed to by the feed, and indeed intend to extend this to go and get bit.ly links that are pointed to by tweets and bring them in as part of the same asset.  Fortunately having all this stuff in the cloud makes it relatively easy to scale the system to handle increased demand, though it currently needs to be done manually.   The next generation back end that I’m just starting to work on now will have the ability to auto-scale and will make use of a lot more of the features that AWS offers, especially the SQS Message Service.

The first thing to benefit from that will be in Twitter handling as we intend to start consuming more of the fire hose as well as allowing more complex queries involving operators such as NEAR and EXCLUDE.  No date for that of course.

Next time out I shall go into some more details about the specifics of integration with AWS, the fun of rolling your own C library to do that and just how to get multicurl to work.

Foursquare and existentialism: why am I here?

Tuesday, March 16th, 2010

I signed up for foursquare about a month ago after reading an article in the Boston Globe about it, and noticing that a coworker was also signed up for it. At first, seemed simple. Anytime you go to a restaurant or store or local Dunkin’ Donuts, you “check in”. As you check in to different places, you earn Boy Scout-like badges. Visit a place often, and you become the “mayor”. Some places, mainly restaurants and bars, have promotions where you get a free drink or appetizer if you show them that you are checking in there.

At first, I jumped right in, and started checking in everywhere. Run an errand to buy blinds at Lowe’s, check-in. Stop by Target for some cat food, check-in. Grab a cup of coffee on the Mass Pike headed out to Amherst, check-in. Pretty soon, I became the Mayor of Lowe’s in Plainville, and that’s when I stopped to ask the question, “what’s the point of all this?”

(more…)