Query Grammar

The query grammar as used by Lexalytics supports the following elements.

Query operators

Note that operators MUST be capitalized, otherwise they will be treated as a query term. Query operators must also be preceded and followed by query terms or query phrases.

OR Operator

Inside a query, the OR operator may be used to retrieve documents containing either of two terms:
	onions OR cheese

AND Operator

Inside a query, the AND operator may be used to retrieve documents containing both specified terms:
	onions AND cheese

NEAR Operator

A NEAR operator is effectively an AND operator where you can control the distance between the words.
	onions NEAR cheese
means that the term cheese must exist with 10 tokens of onions. You can vary the distance the NEAR operation will use by adding a number suffix such as NEAR/50 which means the terms must exist within 50 words of each other. This window can be between 0 and 99. If you do not specify the NEAR window, the default is NEAR/10.
NOTE: Salience releases prior to Salience 5.2 do not allow NEAR/0.

WITH Operator

A WITH operator requires that the two terms occur within the same sentence. As such, it is the same as a NEAR operator, with the exception that the match window between the two terms is not specified.
	onions WITH cheese
means that the term cheese must exist within the same sentence as onions.

NOT Operator

The NOT operator excludes any documents containing the term which follows it:
	onions NOT celery
Note that a query must contain at least one non-excluded term.

NOTWITH Operator

The NOTWITH operator requires that the left hand side appear in a sentence that does not contain the right hand side.
	onions NOTWITH celery
means that onions must appear in a sentence that does not contain the words celery. This is different than onions NOT (onions WITH celery), as that query will reject a document that has onions and celery together, even if onions separately appears on its own. For example:
"I like onions. I especially like onions with celery."
Would hit onions NOTWITH celery (since the first sentence has onions but not celery), while onions NOT (onions WITH celery) would not hit (because a sentence contains onions and celery).
NOTE: Operator added in Salience 6.2.

NOTNEAR Operator

The NOTNEAR operator requires that the left hand side appear in the document separated from the right hand side. By default, it cannot appear within 10 tokens of the right hand side, but the syntax NOTNEAR/N can be used to change this window.
	onions NOTNEAR/2 celery
means that onions must appear somewhere in the document at least two tokens away from celery. This is different than onions NOT (onions NEAR/2 celery), as that query will reject a document that has onions and celery together, even if onions separately appears on its own. For example:
"I like onions. I especially like onions with celery."
Would hit onions NOTNEAR/2 celery (since the first sentence has onions but not celery), while onions NOT (onions NEAR/2 celery) would not hit (becaus onions and celery appear somewhere together.).
NOTE: Operator added in Salience 6.2.

EXCLUDE Operator

Two query terms of any type my be joined by an EXCLUDE operator. For example:
      York EXCLUDE "New York"
The effect is different than use of the NOT operator. What is desired in the example above is a hit on text that contains the word "York", excluding those which only contain occurrences of "New York". Consider the following sample text:
I spent the day in York, visiting the magnificent cathedral. Then it was time to head back to London for my flight home to New York.
This text would generate the following results for the provided queries:
  • York NOT "New York": FALSE
  • York EXCLUDE "New York": TRUE

Parentheses

Queries can use parentheses to control the logic of the query:
	((onions OR cheese) AND celery) NOT horrible
	(onions OR cheese) NEAR (horrible OR disgusting)
Notes on use of parentheses:
  1. Every left parenthesis must have a corresponding right parenthesis.
  2. You can nest parentheses up to 10 levels deep.

Subqueries

Queries can be referenced by other queries using a special character (^):
	Subquery1	"Concept Matrix" OR "Concept Matrices"
        Subquery2	Collections AND Facets AND "Additional Language Support"
        MainQuery1	Salience AND "5.0" OR ^Subquery1
        MainQuery2	^Subquery1 NEAR/10 ^Subquery2
Notes on use of subqueries:
  1. A query must be defined above another query that uses it as a subquery.
  2. A query that is to be referenced by another query cannot contain whitespace in the query label (e.g. Subquery1, not Subquery 1)

Metadata criteria

Certain metadata criteria can be included by enclosing accepted keywords within braces:
{<entity|document> <entity type> : <sentiment criteria>}
The syntax above allows for the first component of the metadata criteria to be either entity or document.
If the first component is entity, it may be followed by an entity_type. This may be any of the entity types supported by the Salience entity extraction model, company, person, place, or product.
Optionally, a sentiment criteria component may be added. Sentiment criteria can be a comparison of document or entity sentiment to a single value, or a range.
Based on these specifications, the following metadata query phrases are valid:
"merger announcement" NEAR/5 {entity company}
"merger announcement" AND {document: sentiment > 0.2}
"merger announcement" AND {entity company: sentiment > 0.2}
"merger announcement" AND {entity company: 0 < sentiment < 0.25}
NOTE: Prior to Salience 6.2, only Named Entities were supported in metadata queries. Now User Entities and Named Entities will both be matched by a metadata query.
NOTE: The NEAR and WITH operators assume usage with text-level elements, it is not valid to use the {document: sentiment} construction with these query operators.

"merger announcement" NEAR/5 {entity company} : Valid "merger announcement" NEAR/5 {document: sentiment > 0.2} : Invalid

Query terms

Single query terms

Single query terms are the simplest query element, consisting of a single word:
	broccoli
Notes on query terms:
  1. A single query term cannot be a word that appears in a stopword list or an operator.
  2. A single query term cannot contain punctuation or other special characters.
` ! @ # * $ % ^ ( ) _ = ~ + [ ] { } ( ) | " ' : ; . , < > ? / \

Phrase searches

Phrases may be enclosed in double quotes:
	"broccoli cheese"
Notes on phrase searches:
  1. Double quotes must begin and end a query phrase.
  2. When a single word is enclosed in quotes, it is not treated as a phrase search. It is treated like a single word, as if it were not in quotes. (Ex. "broccoli" = broccoli)
  3. The special characters listed above as invalid for single query terms may be used within query phrases provided they are escaped using a \ character.
  4. NOTE: Salience 5.2 and above allow @ and # to be used within quoted query phrases without escaping.
  5. For multi-word phrases searches, only the right-most word is stemmed. The query process will not stem all words within the multi-word phrase. (Ex. "driving on faster roads", will match "driving on faster road-" but will not match "driving on fast- roads")

Wildcards

A wildcard character may be used at the end of a single word query term, or at the end of a phrase. For example:
  • excit* : This would match excite, exciting, excitement, etc.
  • "health agen*" : This would match "health agenda" or "health agency" or "health agencies".
Note, there must be at least a three-letter prefix to a wildcard query:
  • d* : Invalid
  • do* : Invalid
  • dog* : Valid

Negation

By default, a query on dog will hit on normal and negated mentions. "No dogs allowed" is about dogs. In some instances, you may want to restrict a token to only match negated or nonnegated forms: If you want to find hotels with pools, then "There was no pool" isn't a match. Prefixing a term with a + restricts it to only nonnegated matches, while a - restricts it only negated matches. For example:
 +happy OR -sad
Would hit the sentences "I am happy" and "I haven't been sad for a while", but not hit "I'm not happy" and "I am sad."
The default is to match both negated and nonnegated if no + or - is specified. If you have a use case that requires mostly nonnegated matches you can set the following option instead of modifying every query token: SALIENCEOPTION_QUERYDEFAULTACCEPTSNEGATED.

Case-sensitivity

By default, query terms are handled in a case-insensitive manner. Case-sensitivity on a query term can be enforced using the ~ operator.
      ~Google NEAR/10 Microsoft
The above query will hit for the phrase "Both tech giants Microsoft and Google are investing heavily in mobile technologies" but will not hit for the phrase "who wins in search, microsoft bing or google?"
NOTE: The use of the tilde (~) operator in query syntax indicates case sensitivity. This differs from its use in other data directory files.

In pattern files, the tilde (~) operator enforces case insensitivity, not case sensitivity.
In HSD files for phrase-based sentiment, the tilde blacklists a phrase so it is not considered in sentiment calculations.

Stemming

By default, query terms are stemmed. As mentioned above, in the case of multi-word phrases, the last term in the phrase is the only term that is stemmed.
Stemming can be turned off for individual words via the stemwords.dat data file.
Stemming can be turned off for an entire query using the ! operator.
      !driving AND faster AND roads
NOTE: The use of the exclamation (!) operator in query topic definitions prevents stemming of query terms. This differs from its use in other data directory files.

In pattern files, the exclamation (!) operator negates a pattern match component.

Globally, stemming can be turned off using the API option SALIENCEOPTION_QUERYTOPICSTEMMING.

Accents/Diacritics

If you want Salience to respect accent/diacritic differences in queries and content, so that the query "mere" does not hit the content "mère", turn the Ignore Accents option off.
If this option is not disabled (the default behavior) then queries without any diacritics will match content that contains them. This can be particularly useful for social media, where speakers are often casual about the use of diacritics. If you include the accents in your query, Salience assumes you want them and will only match content with the same diacritics.

Hashtags/Mentions

Starting in Salience 6.2, hashtags and twitter style mentions are stemmed, e.g. #dogs is stemmed to #dog.
Additionally, starting in Salience 6.2 a query token will match hashtag and twitter style mentions of itself. The query:
	truck
will hit both @truck and #truck. Phrase searches do not have this behavior: querying on
	"truck"
only hits the literal phrase truck, not #truck or @truck. Finally, querying for the hashtag or mention form only finds that form:
	@truck
Will not match #truck or truck, only @truck.

Stopwords

Stopwords, or words that will not be considered for query terms, can be added or removed through the stopwords.dat data file. Use of stopwords in a query term will generate an error on the use of the query.

Query validation checklist

Salience will generate error messages on invalid query construction. To avoid a large number of the errors that can occur, use the following checklist to validate your queries.

  1. Query is in the format: query-title<tab>query-term(s)
  2. Query length less than 10000 characters
  3. Query cannot be empty
  4. Operators must be capitalized
  5. When specified, the window for a NEAR operator must be between 0 and 99
  6. Every operator must have a term on the left side and right side
  7. Parentheses and quotes must be balanced
  8. Query terms cannot contain invalid characters: ` ! @ # $ % ^ ( ) _ = ~ + [ ] { } ( ) | " ' : ; . , < > ? /
  9. The characters that are invalid for single query terms may be used in phrase terms, but must be escaped using \, with the exception of @ and #
  10. Query should not contain stopwords