Tuesday, January 17, 2017

powershell: filtering for unique lines of csv

Scenario: You have a large csv file (several hundred meg or more) representing username to computer name mappings.  The data contains a lot of duplicates as it represents activity over a period of time.  The data is already sorted by time, so how do you get the most recent activity per computer while ignoring the rest?

Pipeline method with commandlets:

import-csv .\data.csv| select -Unique computername|ConvertTo-Csv -NoTypeInformation
 |out-file .\filtered-data.csv

This ran for hours, hit several hundred MB of ram usage and eventually had to be cancelled as it was taking too long.  Unfortunately for the unique filtering on select, it had to do csv conversions to get the attribute that I wanted to filter on.


Hackish method with hash table:

$ht = new-object hashtable

function selective-add {
 [CmdletBinding()]
   param ( [Parameter(Mandatory=$True,ValueFromPipeline=$True)]$line )
   begin {}
   process {
     $data = $line.split(',')
     if ($ht.contains($data[1])) {} else {
       $ht.add($data[1],$data[0])
     }
   }
}

get-content .\data.csv | selective-add
$ht.keys | % { add-content -path filtered-data.csv -value $("{0},{1}" -f $_, $ht.Item($_)) }

This only took about 15 minutes, however it sucked up twice as much ram as the previous method in a very short period of time.