Pattern (.ptn) files are made up of three tab separated columns, corresponding to three phases in generating entities. First, candidate entities are found, then these get checked against an authority list, and finally the successful matches can be customized.
The first tab-delimited column is the pattern that generates candidate entities. This is written in our pattern syntax language. This could be a regex, a grammatical pattern, a string match, or some more complicated combination of these. It could also be **, which matches anything, if you're only interested in using the authority list.
|See Pattern Syntax for more details on the pattern language features supported in pattern files.
The second column is the authority list file. This column can be empty if you want any candidate phrases to become entities. Enter the file name, without any path (it needs to reside in the same folder as the .ptn file). If you prefix the file name with a ~, it will be matched ignoring case differences. If you are not using an authority list, make sure you put two tabs down between the candidate pattern and the entity rules!
The final column sets data about any entity that has successfully matched both the candidate pattern and the authority list (if applicable). You can have multiple rules, comma separated. Strings should be surrounded by quotation marks. Here are the building blocks of rules:
- normalized: The preferred way of referring to this entity, particularly useful for entities with multiple names. For example, IBM is sometimes called Big Blue: By normalizing both to IBM you are telling our software that these are the same company, and share the same sentiment within a document. Normalization also makes it easier to analyze the data you get out of Salience.
- score: A value from 0 to 100, defining the confidence you have that this is indeed an entity. A threshold value is optional when requesting entities, defaulting to 55. You can use this value to differentiate your confidence on different entities. This can also be used to choose which set of rules to apply to ambiguous entities, that either matches multiple rules in a pattern, or is a member of multiple types (like a company named for a person). When there's a conflict, the higher scored rules are used.
- label: Label is additional free text attached to an entity. If the types we provide are insufficient, you can use this field to define your own entity type system. Alternatively, you can use label to provide additional information about an entity.
You can set these fields to fixed values, but you can also use various functions to look up values from files or otherwise make dynamic calculations.
- mention: refers back to the candidate phrase that generated this entity. This is usually used to look up values in another file, but can also be used directly. For example, normalized=mention+"_company" could be used to append _company to a company's name.
- length: returns the number of tokens in the entity, if you wish to conditionally score based on that.
- hashset(value, file, column, key = mention, case_sensitive = true),
- hash(file,column,key,default,case_sensitive = true): Value is the information to set. file is a tab delimited file where the first column is a key and any other column contains some other piece of data. The second argument specifies which column to take data from, default is returned if the key is not found. Usually you use mention for the key. An example call would be hashset(normalized,normalization.dat,1,mention,mention,false) to check normalization.dat for the normalization value, using the original text if no normalization is given. Hashset is used when you just want to assign some value from a file to the entity, hash is used when you need to do some additional transformation on the returned value: comparing it to a value, or using it as a key into another file.
- call(file): Runs the entity through the passed in file as if it were a .ptn file. That is, the patterns defined in the given file will be matched against the text that makes up the mention. If a pattern matches, the rules are applied to it. This is very useful if you want to apply different rules conditionally on the content, or if you have a very complicated rule you want to apply to every entity. As with all pattern functions, the file should be refered to just by name, and reside in the same directory as the primary .ptn file.
- query(value): Runs the query 'value' against the document, and returns true or false depending on if it matches. This feature is intended to allow you to create rules that are evaluated conditionally on the context the mention occurs in. For example, if you're interested in the Yankees but want to make sure it's the sports team you could do score = query("baseball OR sports OR homerun") ? 100.0 : 0
- strip(text,characters): Takes the text and removes any letters that appear in the characters field. For example, normalized = strip(mention,"1234567890") would remove numbers from an entity.
Math and Logic:
>,>=,==,<,<=,!= : compare two numbers or strings
+: Add two numbers, or concatanate two strings
&&: AND of two boolean values
|: Bitwise or
&: Bitwise and
x?y:z : if x is true, evaluates and returns the value of y, else evaluates and returns z.
Pattern files should be placed in /data/user/salience/entities/X/, where X is one of the folders shipped. That folder will be the type of the entity that is created. New types cannot be created by creating a new folder: the set of types is fixed. If you need a different categorization scheme for entities, use labels.