Feature description: Entity Type Folders

What are entity type folders?

Entity type folders are a new light-weight method for extending the out-of-the-box named entity recognition (NER) capabilities of Salience. This approach, supported in Salience 6, allows users to define new entity types through simple additions to the user data directory.

How are they different from other customizations to named entity recognition?

At its core, Salience uses a trained model for named entity recognition. The base model is trained to extract entities of the following types: Person, Place, Company, or Product. Users can enhance the extraction of entities of these types using CDL (customer-defined list) files or patterns, and can customize entity results through normalization and the entity label, but the base model does not extract other types of entities.

Users can also create “user-defined entities” through the use of queries or lists. In older versions of Salience, these were known as “simple entities” or “query-defined entities”. Although the flexibility for customizing entity type is greater with user-defined entities than with extensions to named entities, the drawback is reduced mention linking with user-defined entities.

In 2010 Lexalytics released a separate utility called Entity Management Toolkit (which has evolved into the Salience Workbench). This tool was designed for users to create their own trained named entity recognition models to use in parallel with the base model. However, it takes a large amount of content, effort, and tuning to achieve a named entity recognition model that can provide satisfactory results.

How do I create entity type folders?

An entity type folder simply needs to have a unique name, which will be used as the entity type, and a couple of default data files. Let’s say I’m interested in extracting the names of Television Show entities.

First, I create a folder within the user directory to be used when analyzing content, and title it “television”. This folder should be placed within <user directory root>/salience/entities.

At a minimum, the folder needs to include a rules.ptn file and a score.dat file. In almost all cases, you should also provide at least one CDL file. An example entity type folder for extracting television shows is provided at the end of this article under Related files.

File Description
rules.ptn Specifies a rule stating the entities are defined in one or more CDL files contained in this folder
score.dat Provides rules for scoring entity candidates, including references to exclusion files (if provided) and normalization files (if provided)
shows.cdl A standard CDL file containing the entities to extract

NOTE: There are a number of more advanced operations possible within rules.ptn and score.dat files. This article provides the base versions of these data files needed to support the entity type folders feature.

Results from a sample piece of content

Let’s take the following piece of sample content:

                    TV Critics Vote 'West Wing' Top Show;
LOS ANGELES, Jan. 1 /PRNewswire/ -- NBC's award-winning series "The West Wing" has garnered 
another honor -- it's been named the best series on television in a nationwide survey of TV 
critics conducted by Electronic Media, the television industry newspaper.  Electronic Media 
published the full results of its Critics Poll in a copyrighted article Monday, January 1. 
"Malcolm in the Middle," a quirky new family comedy from Fox, finished second. The 
highest ranked new show was NBC's "Ed," followed by "Gilmore Girls" (WB), a freshman 
show that was launched as part of the Family Friendly Programming Forum initiative.  
Veteran series "Law & Order" (NBC) rounded out the critics' top five.     

This produces the following names entity extraction results:

ENTITY TYPE COUNT
NBC Company 3
Jan. 1 Date 1
Los Angeles Place 1
PRNewswire Company 1
January 1 Date 1
Malcolm Person 1
Fox Company 1
Ed Company 1
"The West Wing" Quote 1
"Malcolm In the Middle," Quote 1
"Ed," Quote 1
"Gilmore Girls" Quote 1
"Law & Order" Quote 1

With the entity type folder in place, we see the improved extraction:

ENTITY TYPE COUNT
NBC Company 3
Jan. 1 Date 1
PRNewswire Company 1
Los Angeles Place 1
The West Wing Television 1
January 1 Date 1
Malcolm In the Middle Television 1
Fox Company 1
Ed Television 1
Gilmore Girls Television 1
Law & Order Television 1
"The West Wing" Quote 1
"Malcolm In the Middle," Quote 1
"Ed," Quote 1
"Gilmore Girls" Quote 1
"Law & Order" Quote 1

Although there are still some adjustments we may wish to make to address the television show names identified as quotations, with just the application of 3 files we have created an entirely new entity type for named entity extraction.

Related files

television.zip : Zip file containing example entity type folder for extraction of television shows.

Back to top