Archive for the ‘General’ Category

Support Notes: Deploying the Salience assemblies on a Windows network drive

Thursday, December 20th, 2012

This is a brief note from a professional services engagement we have in progress currently. Our professional services folks are developing a custom Excel plugin for a client that uses Salience for analyzing the content contained in their worksheets within the familiar Excel environment.

The Excel plug-in integrates with Salience through a Microsoft Office add-in written in C# and the Salience .Net wrapper. The client wanted the end-user desktops to only have the add-in installed, with Salience and its related data directory residing on a mapped network drive. Read on to see how we accomplished this.


Ten things to know about Salience (part 2)

Friday, April 22nd, 2011

This is a follow-on to our list of ten things to know about Salience Engine. Together, these two articles are intended to guide developers in some of the main aspects of working with Salience Engine when they first start out.

In the first part, most of the topics focused on deployment strategies and approaches. In this second part, we’ll look at areas of tuning results from Salience Engine. So let’s roll up our shirt-sleeves and get back into it…


Ten things to know about Salience (part 1)

Friday, April 22nd, 2011

I had a meeting with a client recently, and one of the suggestions they raised was a list of the top 10 things that an engineer should know when they start working with Salience Engine. Some of these may seem basic, however it’s not safe to assume that things which seem obvious actually are. With all due respect to David Letterman and his Late Night Top Ten lists, here we go…


¿Sabe usted español? Você sabe português?

Thursday, March 10th, 2011

The Salience Engine developed by Lexalytics is capable of industry-leading entity extraction, sentiment analysis, theme extraction, summarization, and other text analysis. Prior to our latest release however, this functionality was limited to the analysis of English content only. Our release of Salience 4.4 introduced support for the analysis of French text. This support was built from the ground up to provide the engine with a deep and native understanding of French. Our French support is already being deployed by MediaVantage (MédiaVantage et Lexalytics offrent la une analyse automatique de ton en plusieurs langues) for their clients that need to analyze both English and French content. As my grade school French teacher would say, “C’est formidable!”

For our next release later this year, we’re setting our sights on additional language support. This time, Spanish and Portuguese. In order to achieve this, we will follow the same recipe of gathering knowledge for native language resources to put together the building blocks needed to provide true support for Spanish and Portuguese.

If you are fluent in Spanish or Portuguese, this is where you can help us. We are looking for resources to assist with the annotation of content for use in training our engine. NOTE: This is not translation!

Please review the following criteria carefully:

  1. Must be a native or highly fluent speaker of Spanish (Latin American or European) or Portuguese (Brazilian or European).
  2. We prefer individual contractors, not agencies.
  3. No placement agencies, please.
  4. We prefer US-based. Makes payment, taxes, etc. much easier.
  5. Would prefer resources in the Boston area, but not required.
  6. This is short-term contract work, not full-time employment.
  7. You must have a Windows PC that you can use to run our utilities that aid the annotation work.

If you meet these criteria, and are interested in helping us understand Spanish or Portuguese, please get in touch with us by emailing, with “Spanish contract resource” or “Portuguese contract resource” as the subject. We look forward to hearing from you.

Gracias a todos. Muito obrigado.

Does @CharlieSheen really have @klout?

Friday, March 4th, 2011

It’s been impossible this week to get away from Charlie Sheen. He’s all over the airwaves on television and radio, and seems like everyone is commenting on his interviews on Twitter, Facebook, and other sites. But one thing that caught my attention (thanks to @eric_andersen)yesterday was an article posted on klout, “Charlie Sheen Needs a Klout Score”.  The article mentions the Charlie Sheen’s Klout score and justifies it, concluding with the statement:

At Klout we measure influence which we define as the ability to drive measurable actions across the social web. Charlie’s first tweet contained a link to a picture – that link has been clicked through 455,000 times at the time of this writing (6:39PM PST).

So continuing in the tradition of Friday blog articles that are less technical but still related to text analytics or software engineering in general, I wanted to think about whether this really does provide a measure of influence.


A Tempest in a Teacup

Wednesday, March 24th, 2010

So now that Lexascope and Tempest are out in the open, I thought a small blog post on some of the implementation details would be of interest.  Plus I haven’t written one of these for a long time and I’m getting nagged :)

Early on (within the first couple of weeks) of getting the project off the ground, we decided that we were going to go with a cloud based architecture rather than host everything ourselves and AWS seemed to be the no brainer choice.  It offered excellent reliability, control over our costs and relatively simple API’s for us to work with.  Oh it also offered Windows based images, which for some of us (yes I mean me) is only a good thing.  My Linux skills are the thing of legend

We had a couple of abortive starts, as most projects tend to do, for instance we foolishly assumed that an EBS volume could be attached to multiple EC2 instances at the same time.  Ooops!  Back to the drawing board on that method of communication then.

The other big issue we had early on was our choice of MySQL as the central database storage mechanism.  In our original plan, a lot of calculations and rollups were going to occur server side, so we had functions that enabled you to do things like request the top 10 themes for my project etc.  However our early tests indicated that the almost constant inserts coupled the heavy processing load that these complex queries produced meant that a single database wasn’t going to be up to scratch.

Interestingly though, discussions with potential users of the system indicated that they would prefer access to the raw data so we made some changes based on this and switched our storage mechanism from MySQL to be Amazons Simple DB, as we no longer needed to do any complex rollups.

So where are we now.

The system basically consists of 3 groups of machines.

Feed Gathering – These processes are responsible for getting all the RSS and ATOM based content into the system, including Twitter content.  When a feed gatherer is told to go get a particular feed it:

  1. Grabs the specified feed and checks for new content
  2. Follow the links to any new content and extract the text from the HTML
  3. Enhance the text with Salience and and User Defined Categories
  4. Insert the Results into the SDB store

It then goes into an idle state until it is told to go get its next bit of work.

User Content Gathering – These deal with all the content that gets added directly by users such as URLs and load files.  It basically monitors the content queue for new user added content, enhances any assets that we find and then inserts the results into the SDB store

Project Mapping – When feed assets are inserted into the store they aren’t mapped to a particular user or projects.  This is because regardless of how many people have added a particular feed we only go and get it once.  Project mapping maps the inserted assets to the correct projects a the correct time.  It also handles the insert limits that the various account types have.

There are also various ancillary processes that handle things like deleting expired content and logging what’s actually going on in the system.

By far the biggest load is currently in Feed Processing and I expect this to continue.  Especially as we go and get the content pointed to by the feed, and indeed intend to extend this to go and get links that are pointed to by tweets and bring them in as part of the same asset.  Fortunately having all this stuff in the cloud makes it relatively easy to scale the system to handle increased demand, though it currently needs to be done manually.   The next generation back end that I’m just starting to work on now will have the ability to auto-scale and will make use of a lot more of the features that AWS offers, especially the SQS Message Service.

The first thing to benefit from that will be in Twitter handling as we intend to start consuming more of the fire hose as well as allowing more complex queries involving operators such as NEAR and EXCLUDE.  No date for that of course.

Next time out I shall go into some more details about the specifics of integration with AWS, the fun of rolling your own C library to do that and just how to get multicurl to work.

Foursquare and existentialism: why am I here?

Tuesday, March 16th, 2010

I signed up for foursquare about a month ago after reading an article in the Boston Globe about it, and noticing that a coworker was also signed up for it. At first, seemed simple. Anytime you go to a restaurant or store or local Dunkin’ Donuts, you “check in”. As you check in to different places, you earn Boy Scout-like badges. Visit a place often, and you become the “mayor”. Some places, mainly restaurants and bars, have promotions where you get a free drink or appetizer if you show them that you are checking in there.

At first, I jumped right in, and started checking in everywhere. Run an errand to buy blinds at Lowe’s, check-in. Stop by Target for some cat food, check-in. Grab a cup of coffee on the Mass Pike headed out to Amherst, check-in. Pretty soon, I became the Mayor of Lowe’s in Plainville, and that’s when I stopped to ask the question, “what’s the point of all this?”


The Internet is an Archive

Tuesday, January 26th, 2010

The Internet has changed the way we shop, communicate, and get our questions answered. Someday, it may change the way we understand the past as well. Unless you carefully preserve it, ink on paper does not have an especially long life. And even writing that does survive is difficult to explore. Text and artifacts of an ended era are often scattered through libraries, museums and private collections. An exhaustive search of the material requires dedication and plane tickets.

But my great-great-grandchildren might be able to look me up on Facebook or Twitter. They could see photos of me through different phases of my life. They could find out who my friends were, what music I enjoyed, even download those songs. And further down the road some distant relative could trace the many paths that led to him: the familial successes and tragedies; the dreams, hopes and unlikely coincidences that came together to cause a single birth. Genealogy is a popular topic today, will it grow even more so when, through a lifetime of expression, you can get to know intimately any of your ancestors?

Text Analytics, I suspect, will play a major role in the future of history. In twitter streams and blog posts, historians will be able to watch the ebb and flow of ideas. Traditionally, outbreaks of war draw the world’s attention, and we watch as the conflict plays out. But with pervasive archival, we’ll be able to follow the causes backwards as well. The countless anonymous participants in great moments will have identities that can be discovered, motivations and personalities that were formerly lost.

The Internet has the potential to be an archival form unlikely anything the world has ever experienced. In its youth it’s easy to lose sight of this: Less than thirty years old and much younger in terms of mass popularity, there isn’t a deep history to explore. But I think in the long run the Internet’s greatest contribution might end up being what it can teach us about ourselves by protecting our individualities from the ever growing entropy of time.

Building a Firefox extension

Thursday, July 2nd, 2009

Over the past couple of days I’ve been experimenting with building a Firefox toolbar to work with our upcoming Lexascope API.  As normal with these things a quick bing of the web (yeah I switched to Bing for a couple of months to see if is any better / worse that Google) pointed me in the direction of some excellent articles that really helped me get up and running, so in the tradition of the web I thought I’d share the best ones.


There are a few step by step guides that enable you to get up and started but not one that covers everything.  However reading these should at least get you to the position of having something that works

and my personal favourite

Reference Guides

If you don’t know XUL (and beyond existing Firefox devs, who does?)


So hopefully those resources will help you as much as they helped me, using these I was able to get my toolbar up and running in just a few hours, going from complete newb to slightly less than compete newb.

QA, QA my kingdom for some QA

Wednesday, July 1st, 2009

As anyone who has worked with me over my career will know, QA is something that I don’t pay a massive of amount of attention to.  Sure its something that has to be done, but customer pressures have generally meant that its been an afterthought rather than something that is front and centre.

That’s not to say things went out untested, just that the testing would generally happen in one of two ways

  1. The developer would build a test harness to make sure that the code performed the way that they wanted it to.  Of course the major problem with this is that the testing would normally take place on the developers own machine, and that if you wrote the code you are naturally going to write a test harness that exercises it in the way it is supposed to be used. 
  2. Ask the rest of the company to test a forthcoming release in a sort of beta testing mode.  Of course what normally happens is that the rest of the company is busy as well and if you are lucky one or two people will give it a quick once over.

Recently this has all changed however with Carl and Paul putting some serious time into building both test frameworks and more importantly coming up with both tests for new functionality, as well as a set of regression tests.  And its worked fantastically, with the tests being able to catch several bugs and functional problems before any code went out the door, even in a beta state.  This can only be good news.

Having a concrete set of tests with associated numbers has also enabled us to give a lot more accountability to our lords and masters who sign the pay checks, which is frankly a win win all-round.  They get to see what we have been doing and we get to show how clever we have been!

So consider me a convert and bring on the tests – just so long as I don’t have to write them :)