Concept Topics use the Lexalytics' Concept Matrix to support "fuzzy" categorization. With a traditional boolean query, you specify exactly which terms must or must not appear in a matching document. With the Concept Matrix, we can support conceptual matches that return documents related to your query, even if different terms are used. For example, a query for 'food' in a boolean search engine would return only articles using the specific word 'food', but in a Concept Topic you'd also get results that discuss pizza, sandwiches, and other phrases we associate with food.
A concept topic is made up of 3 tab delimited columns, the third one being optional. The first column is the Topic Name. This is the string that will be reported in your result structures. The second column in the query. This defines what exactly you're looking for. The third (optional) column is the weight. The score returned by the engine will be multiplied by this value before getting reported.
[Boolean filter queries]
Display Quality<tab>Display_quality [smartphone OR iphone OR android]
NOT ["display booth" OR "display version"]<tab>1.2
Your queries will be expanded out to a large set of related concepts. The breadth of this set impacts how strongly documents are likely to match. A very specific, well-defined query ("rocket fuel") will tend to match documents very strongly, or not at all. A very broad query ("business") has a tendency to match many documents to a limited degree. While a single document can exhaustively discuss most of the ideas related to rocket fuel, the same cannot be said of business. The weight column allows you to put queries with different specificity on the same scale. The weight is a real valued number (that is, decimal places are allowed) that all scores will be multiplied by. Our recommended best practice is to choose a target baseline for a good match (0.7, perhaps), find a handful of documents matching each topic, run them against the concept topic, and calculate which weight will cause all topic matches to average to that baseline.
In addition to adjusting the weight for the overall query, you can add term weights to individual terms within a query to emphasize stronger relevance to the concept. The syntax for term weighting is shown below applied to the first keyword "education" in the concept topic definition:
Education<tab>education/1.3, learning, pedagogy<tab>1.1
Although you can create long, complicated concept queries, we do not recommend the practice. Besides being an issue for maintenance, results generally degrade with complexity. A better solution is often to break very broad topics into smaller subtopics, and map these subtopics together in a consuming application. That is, rather than having a single query for all of 'Business', you may have better results pulling in 'human resources', 'regulatory compliance', etc. separately.
Ad hoc analysis of results is appropriate during rapid, informal development of concept queries. However, when you're looking for optimal performance and robust results, formal testing is encouraged. By annotating documents with the topics you wish to see each match, then calculating precision and recall for each query, you can use good statistical data to guide decisions. With the annotations you can make small changes to query definitions and weightings and see whether your results improve. You can also identify concepts the concept matrix has difficulty with, and address those queries specially (either by breaking the topic into more specific subtopics, or falling back to more traditional query based topics). Precision and recall are calculated as follows:
Precision = # of correct matches / (# of correct matches + # of incorrect matches)
Recall = # of correct matches / (# of correct matches + # of matches missed )