Concept Grammar

Concept Topics use the Lexalytics' Concept Matrix to support "fuzzy" categorization. With a traditional boolean query, you specify exactly which terms must or must not appear in a matching document. With the Concept Matrix, we can support conceptual matches that return documents related to your query, even if different terms are used. For example, a query for 'food' in a boolean search engine would return only articles using the specific word 'food', but in a Concept Topic you'd also get results that discuss pizza, sandwiches, and other phrases we associate with food.

File Syntax:

A concept topic is made up of 3 tab delimited columns, the third one being optional. The first column is the Topic Name. This is the string that will be reported in your result structures. The second column in the query. This defines what exactly you're looking for. The third (optional) column is the weight. The score returned by the engine will be multiplied by this value before getting reported.

Query Syntax:

text

The basic query syntax is simply words and phrases expressing the idea you'd like to match. You can use commas to break up separate ideas. 'oil paint' is looking for articles about the artistic medium, while 'oil, paint' is looking for articles about petroleum and/or any type of paint. The list can be as long as you'd like, although many short queries usually outperform very long, detailed ones. A perfect match would be related to all the terms given, but partial matches can occur where the article is only related to a subset of the terms.

_ (underscore)

When the concept matrix is given a phrase, it matches both the phrase form as well as the individual words. Thus 'power plant', while matching stories about electric generation most strongly, may also pull in articles about plant life. In most queries the individual words in a phrase are related and contribute positively. But in cases where the individual words mean something different on their own, underscore instructs the engine to only use the phrase form. Thus 'power_plant' will not match articles about flowers at all.

NOT

NOT excludes certain ideas from consideration. This operator is primarily intended for narrowing down the meanings of words and phrases, or otherwise limiting the scope implied by a word or phrase. For example: 'food NOT meat, animal' would not match an article about hamburgers. However, it will still match an article that talks about salads and hamburgers (just not as strongly as it otherwise would have).

CONTEXT

While NOT excludes certain ideas implied by a query, CONTEXT highlights certain ideas. Consider the case where you are interested in automobile manufacturing. The query 'automobile, manufacturing' is likely to get relevant results, but may also pull in articles about manufacturing in general. The query 'automobile_manufacturing' is highly specific, but possibly overly so. Automobile CONTEXT manufacturing will result in a search for automotive in general, with a focus on manufacturing. It will not return results just about manufacturing. Specifically, the text to the left of CONTEXT supplies the general idea being searched for, and the text to the right supplies the ideas you want that topic to be discussed with.

[Boolean filter queries]

In addition to a list of terms to aid in defining the concept you are looking to match, boolean query logic can be included in a concept topic definition to provide a level of specific filtering for concept matches. Boolean queries are enclosed within brackets in the concept topic definition. For example:
(note: this query should be on a single line in your concept topic file)
Display Quality<tab>Display_quality [smartphone OR iphone OR android]
NOT ["display booth" OR "display version"]<tab>1.2
The above query would match content that conceptually discuss "display quality" and specifically use the word smartphone/iphone/android. Content that additionally contains the phrase "display booth" or "display version" would be excluded from a match to this concept.

Weight

Your queries will be expanded out to a large set of related concepts. The breadth of this set impacts how strongly documents are likely to match. A very specific, well-defined query ("rocket fuel") will tend to match documents very strongly, or not at all. A very broad query ("business") has a tendency to match many documents to a limited degree. While a single document can exhaustively discuss most of the ideas related to rocket fuel, the same cannot be said of business. The weight column allows you to put queries with different specificity on the same scale. The weight is a real valued number (that is, decimal places are allowed) that all scores will be multiplied by. Our recommended best practice is to choose a target baseline for a good match (0.7, perhaps), find a handful of documents matching each topic, run them against the concept topic, and calculate which weight will cause all topic matches to average to that baseline.

Term Weighting

In addition to adjusting the weight for the overall query, you can add term weights to individual terms within a query to emphasize stronger relevance to the concept. The syntax for term weighting is shown below applied to the first keyword "education" in the concept topic definition:

Education<tab>education/1.3, learning, pedagogy<tab>1.1

Best Practices

Although you can create long, complicated concept queries, we do not recommend the practice. Besides being an issue for maintenance, results generally degrade with complexity. A better solution is often to break very broad topics into smaller subtopics, and map these subtopics together in a consuming application. That is, rather than having a single query for all of 'Business', you may have better results pulling in 'human resources', 'regulatory compliance', etc. separately.

Ad hoc analysis of results is appropriate during rapid, informal development of concept queries. However, when you're looking for optimal performance and robust results, formal testing is encouraged. By annotating documents with the topics you wish to see each match, then calculating precision and recall for each query, you can use good statistical data to guide decisions. With the annotations you can make small changes to query definitions and weightings and see whether your results improve. You can also identify concepts the concept matrix has difficulty with, and address those queries specially (either by breaking the topic into more specific subtopics, or falling back to more traditional query based topics). Precision and recall are calculated as follows:

Precision = # of correct matches / (# of correct matches + # of incorrect matches)

Recall = # of correct matches / (# of correct matches + # of matches missed )