Entity extraction in Salience Five

December 21st, 2011 by Carl Lambrecht

I wanted to write up a detailed explanation of the methods of entity extraction available in Salience Five for a client, where they overlap and where they differ. And as I did, I thought, “That would make for a bloody useful blog post for the dev blog.” So here it is.

Prior to Salience 4.x, entity extraction was solely list-based. Salience 4.0 introduced model-based entity extraction, which allowed for novel entity extraction. In other words, “I didn’t think to add ‘John Smith’ to my list of people to extract, but Salience Engine found him in today’s news magically because it knows what names of people look like.” Very powerful stuff.

Salience Five continues to provide model-based and list-based entity extraction found in Salience 4.x, with some of the same cross-over between the two and modification to the terminology.

Named entities:

These are primarily entities identified by our entity extraction model. In Salience 4.x, these were just called “entities”. The terminology change in Salience Five is intended to be a bit more specific.

Model-based entity extraction is great, no need for an exhaustive list of people, places, companies. The Salience Engine has been trained to recognize what a place name looks like in content. Very useful in discovery scenarios where you may not know all the company or person names that may occur in your content. In other words, “I didn’t think to add ‘John Smith’ to my list of people to extract, but Salience Engine found him in today’s news magically because it knows what names of people look like.” Very powerful stuff.

  • Advantage: out-of-the-box entity extraction with no additional effort
  • Disadvantage: Model-based entity extraction does not have 100% precision or recall

Named entities are extracted by API calls to lxaGetNamedEntities and equivalents in the .Net, Java, PHP, and python wrappers.

CDL files:

Customer-defined list files that augment model-based named entity extraction. You might not know all the companies that will be mentioned, but you want to improve the pickup on the ones you do. In this case, you create a CDL file, and place it in  the folder:

data/user/salience/entities/companies

Any .cdl file in a subfolder under entities will get read into memory when the engine starts up.

  • Advantage: tweaks out-of-the-box named entity extraction to your needs
  • Disadvantage: CDL files must be in user directory to be read during session initialization, can’t see effect of changes unless engine reinitialized

The structure of CDL files is also documented on the developer wiki under the topic Customer-Defined Lists (CDL).

User-defined entities:

This is list-only entity extraction. In Salience 4.x, this was called “Simple Entities”. You specify to Salience Engine the entities you are looking for, and no others. The format of the user entity list is the same as a CDL file, but because the user-defined list is specified as an option, it does not need to reside in a specific location in the user directory and, more importantly, can be changed without reinitializing the engine. Want a CDL file for extracting the names of Republican politicians? And a different one for Democratic politicians? No problem. Start up a Salience session, prepare the text, set the user-entity list for Republicans, get the user-defined entities, then set the user-entity list for Democrats, and get user-defined entities again.

  • Advantage: very dynamic, run multiple entity extraction customizations without reinitializing session
  • Disadvantage: differences in mention-linking, only extracts entities from list, no novel entities extracted

User-defined entities are extracted by API calls to lxaGetUserDefinedEntities and equivalents in the .Net, Java, PHP, and python wrappers.

Comments are closed.