Using Salience via PowerShell (part 3): Text Files

August 5th, 2011 by Matt King

Today’s assignment: Convert some docx files to txt and then time how long it takes to process them, getting document sentiment and entities. Use PowerShell.

So first, lets convert the Word documents to text files:

function Save-AsText($fn) {
  $doc = $word.documents.open($fn.ToString())
  $txtName = $fn.ToString().Replace('docx', 'txt')
  $doc.SaveAs([ref] $txtName, [ref] 2)
  $doc.Close()
  echo $txtName
}

$c = Get-ChildItem -recurse -include *.docx
foreach ($fn in $c) {
    Save-AsText($fn)
}

Now that we’ve got our text files, we can use Measure-Command and Measure-Object to do the measuring:

Add-Type -Path "C:\Program Files\Lexalytics\Salience\bin\SalienceEngineFour.NET.dll"
$se = New-Object Lexalytics.SalienceEngine(
             'C:\Program Files\Lexalytics\license.dat',
             "C:\Program Files\Lexalytics\data")
$timings = @()
$c = Get-ChildItem -recurse -include *.txt
$cnt = 0
$s = 0
foreach ($fn in $c) {
   $m = Measure-Command -OutVariable t {
     $rc = $se.PrepareTextFromFile($fn.toString())
     if ($rc -ne 0) {
       echo "Failed to prepare text with code $rc on $fn"
       continue
     }
     $cnt = $se.GetEntities(0, 0, 0, 0, 50, 5) | Measure-Object | Select-Object Count
     $s = $se.GetDocumentSentiment(0).fScore
   }
   $timings += $t[0].TotalMilliseconds
   Write-Host $fn $cnt $s $t[0].TotalMilliseconds
}

$timings | Measure-Object -minimum -maximum -average -sum

And you’ll end up with a summary at the end like this:

Count    : 100
Average  : 511.2
Sum      : 51120
Maximum  : 999
Minimum  : 63

An average of 511 milliseconds per document for the 100 documents processed.

Comments are closed.