When we talk about "sentiment tuning" we're referring to the process of enhancing the out-of-the-box data files that drive sentiment analysis. These data files have been developed against a general spectrum of content. But the same phrase can take on different sentiment orientation, for example the terms "killer" and "sick" have different sentiment in the healthcare industry versus the use of the same terms in relation to video games.
The main goal when tuning sentiment is to address detected issues with fixes that will also help in a larger sense. The fixes should aim to support the content domain such that there is a cumulative benefit that will over time reduce the number of issues that need addressing instead of spot fixes.
We’ll start with a pure out-of-the-box installation, and customize it to address issues noted in three sample tweets with explanations of the steps, how they help correctly align these tweets, and how they help in the overall sense.
If you are not already familiar with user directories and how they can be used within Salience, I recommend you have a look at before starting. For these tweets, I'm going to start with a user directory in a folder
C:/CustomSentiment as a place to put the content I am assessing, and any custom files I create for tuning. Assuming I have a standard installation of Salience on Windows, I'll copy
C:/Program Files/Lexalytics/data/user to
C:/CustomSentiment/user. Throughout this process, I'll also refer to the Windows Salience Demo application which I'm using to diagnose the sentiment phrases in the default configuration and assessing my tuning steps.
Let's take the example mentioned in the first paragraph. In general, we'd expect "killer" to have negative connotation. But it can also be used as positive slang:
"The New ABC Corp Gaming laptop is a killer...."
CIOs: Educate and develop great IT workers with four killer coaching tips
In both cases, the default value of -0.3 for the phrase "killer" that exists in the default sentiment phrase dictionary (general.hsd) skews the analysis into negative sentiment. We can override the sentiment phrase by creating a new HSD file, and overlaying it on top of this default analysis.
|Think of multiple HSD files as transparent layers, in which phrases that exist within the base HSD can be overridden by the same phrase with a different weight in a higher layer. If there is no override in a higher layer, the phrase and value from the lower layer is used.|
To adjust the sentiment for the phrase "killer", we need to create that additional layer to apply on top of the default sentiment analysis to tune it to our content.
C:/CustomSentiment/user/salience/sentiment/customSentiment.hsdas an additional sentiment dictionary to use, this can also be done via API calls.
That's it. Now, the phrase "killer" has the positive sentiment that is appropriate in the content that we are processing. Note, this is only the appropriate sentiment within this particular content domain, specifically tech tweets which are less likely to exhibit the negative usage of the same phrase. If you are analyzing a broader spectrum of content, the approach may differ to handle the variety of cases.
Let's look at three other specific tweets that require additional tuning steps.
“Love how I pay for @TechSupport & they still can't fix my computer properly. And yet can't find a solution.”
Initial document sentiment: 0.713
The result is driven off a single phrase found in out-of-the-box sentiment dictionaries “properly” which has positive sentiment. In general that would be true. What we see in this tweet however, are phrases that denote actions that would be positive in this domain (“fix my computer”, “find a solution”) but in this case should be negated because they didn’t happen and that is what should be driving the sentiment.
There are two actions that we need to take, and these are actions that will help later down the line as we continue tuning. First, we'll add more phrases to our custom sentiment phrase dictionary to capture sentiment in this domain of content.
fix my computer<tab>0.6
find a solution<tab>0.6
There is an additional step that we need to take before these phrases will become effective, because they contain parts of speech that are not considered part of sentiment-bearing phrases out-of-the-box. In order to customize the patterns for a sentiment-bearing phrase, we need to override a data file called hsd.ptn. This should be done in a user directory, not the default data directory. Having a user directory is also going to help later in further tuning, it gives us a place to make deeper adjustments for this customer/domain.
C:/CustomSentiment/useras the user directory for my session, this can also be done via API calls.
That may seem like a lot of changes just to correct a single tweet. However, that’s mainly because it’s been broken down into the individual steps for clarity and it also lays the groundwork for everything to come.
Following these changes, the sentiment assessed for the tweet aligns to the negative expectation, and for appropriate reasons.
I did not make any adjustments to the phrase “properly”. Ideally we would want that to be negated or neutralized, however I was able to achieve the expected sentiment through the application of domain-specific phrases, and we may encounter “properly” in other cases that could be thrown off by try to adjust it here.
“@TechSupport yeah, i know you can't help me. just wish i wasn't sold a poc in the first place.”
Initial document sentiment: 0.7
As with the first tweet, all of the sentiment is being driven off a single phrase which in the general domain can be considered as something that conveys positive sentiment. In this domain and in this context, it means nothing. We also see another case of an action that is expected to have positive sentiment (“help me”), but should be negated in this case.
There are two actions that we need to take, and they align with the actions taken previously. First, we want to ignore “first place” as a sentiment phrase.
The tilde indicates that this is a phrase to specifically ignore for sentiment. This is different than applying a sentiment weight of 0, which would still include the phrase in the calculation of sentiment, and thus have a neutralizing effect on the sentiment found in other phrases.
The phrase “help me” is also one that needs an adjustment to the standard HSD patterns in order for it to be detected.
Following these changes, the sentiment assessed for the tweet aligns to the negative expectation, and again for appropriate reasons that address the domain.
We could take the additional step of adding “sold a poc” as a negative phrase assuming that what the author actually meant was “sold as pos”. However, that is a spot fix that has low likelihood of benefitting overall sentiment analysis outside of this single occurrence.
“I put in 3 damn genuine Ink Cartriges.. the first 2 "Geniune" - the 3rd "Counterfit" but after I hit OK "SAID GENUINE" WTF”
Initial document sentiment: 0.4
Sentiment is being driven off “OK” and “genuine”, but the real sentiment is in the emphatic "damn" and WTF, which is not detected as a sentiment phrase.
We want to ensure that “damn” has the correct alignment, which can be really tricky. This can be used to intensify something that is good or bad, or as a negation to something good. I chose to make it negative, under the assumption that it’s going to be more likely to denote negative sentiment in this domain. I also want to ensure that WTF is a negative sentiment-bearing phrase:
The problem is, what part-of-speech is WTF? Out of the possibilities, I would consider it an “interjection”, which has a POS tag of UH. But looking at my POS-tagged text output, the out-of-the-box datafiles consider it a proper noun because of the capitalization. I want to force the POS tagging on this token, and ensure it’s included as a possible sentiment-bearing phrase type:
With these changes in place, we’re detecting more of the negative sentiment. Again, we’re not changing the meaning of “genuine”, and we’re not going to chase “counterfit” which is a misspelling. If that misspelling is very common in the content being analyzed, that could be added as an additional negative sentiment phrase. You could place additional customizations to decrease the positive weight of “OK” or “genuine”, but that again depends on what is seen more commonly in this domain and how much weight these phrases should carry.
- The customization of hsd.ptn allows for a broader range of phrases to be identified as possibly sentiment bearing. More importantly, it brings in parts-of-speech that are more self-directed with the use of personal and possessive pronouns, which are more likely to occur in this type of content.
- The customized HSD starts to address the things that are positive or (when negated) negative in this specific domain, actions that are relevant to a customer in this domain.
- There are additional customizations to sentiment phrases that could be made to address specific misspellings seen in these tweets, but these are more likely to be spot fixes, not globally beneficial.
- There are additional customizations to sentiment that could be made to fix these specific cases, but could lead to false results in other cases.
- The customization to customlexicon.dat allow for tailoring of language to the domain.