Tags

,

The Problem

I am trying to write and test R script against some data from customer. But the data is too big, it would take a lot of time to load the data and run the script. So it would be to extract a small fraction from the original data.


The Solution
First extract the first line from the original csv file, write to destination file.
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt


Notice that by default Out-File cmdlet or redirection command >> uses system default encoding when write to a file. Most application by default uses utf-8 or utf-16 to read data. Hence we use -Encoding utf8 here.


Then we randomly select 1000 lines from all lines except the first line.
Then we read all lines except the first line: Get-Content big.csv | where {$_.readcount -gt 1 }


Then randomly select 1000 lines and append them to the destination file.
Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt


The Complete Script
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt; Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt


Related Script: Get default system encoding
[System.Text.Encoding]::Default
[System.Text.Encoding]::Default.EncodingName


Resource
PSTip: Get-Random
Get-Random Cmdlet

via Blogger

Advertisements