I am trying to export approximately 100 million indexed events to txt files but having a hard time finding out how to do so. Since I can't get the Elasticsearch input filter to work in Logstash (secured cluster, open ticket on github and in logstash forums), I am using PowerShell to connect to my cluster and pull the data. I've figured out the query I need to run in order to pull just what I want, but I am stuck at only being able to pull 10,000 events.
How do I get all of what I want OR is there a better way to accomplish this?
I figured out how to do this with the scroll API, however...at 100 million records, it's gonna take almost 4 straight days to pull it all out. I tried using the slice parameter in my query to break it up and run 16 exports in parallel, but it throws an error saying it's not a legitimate parameter. Below is the PowerShell bits I am using...any ideas on how to incorporate slice or what I can do to speed this up?
If anybody ends up using this for their own setup, you'll need to modify a few lines, especially if you are not using SSL on the API.
#Force PowerShell to use TLS 1.2
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::TLS12
#Get username/password for kibana user
$cred = Get-Credential
#Specify index to query - wildcard ending implied
$is = Read-Host "Enter index to search"
#Set base elasticsearch URL
$esbase = "https://FQDN:9200"
#Set headers
$Options = @{
headers = @{
'kbn-version' = "7.2.0"
}
ContentType = "application/json"
Method = "POST"
}
#This can be used as an example, it is querying a date field whose values are in epoch seconds.
$query = '{ "sort": [ "_doc" ], "query" : { "range" : { "edge.timestamp_start": { "gte": "1559329200", "lte": "1561939200" } } } }'
#Initial query specifying parameters and index to search
$isr = Invoke-RestMethod @Options -Uri "$esbase/$is*/_search?scroll=10m&size=10000" -Body $query -Credential $cred -Verbose
#Start loop to process the initial 10k results and subsequent results.
do {
#Grab the current time to be used in the filename
$date = Get-Date -Format MMddyyyy_hhmmss
#Start loop to dump results into a text file
foreach ($hit in $isr.hits.hits) {
#Narrow file output data to only the raw message field. In my case, it's already JSON formatted.
$out = $hit._source.message
#Output message field to a text field, appending the existing file so as not to create 10k tiny files.
$out | Out-File C:\Temp\June$Date.txt -Encoding utf8 -Append
}
#Compress the resulting txt file. In my use case, each txt file was 16MB, compressed was 1.5MB
Compress-Archive -Path C:\Temp\June$Date.txt -Update -DestinationPath C:\Temp\June$Date.zip -CompressionLevel Fastest
#Remove original txt file after compression is complete.
Remove-Item D:\Temp\CloudFlare\June$Date.txt -Confirm:$false
#Grab the scroll id from the previous successful query to feed into the next query.
$sid = $isr._scroll_id
#Reset query timeout to 5 minutes and insert scroll id
$query = ' {"scroll" : "5m", "scroll_id" : "'+$sid+'" }'
#Gather the next 10k results
$isr = Invoke-RestMethod @Options -Uri "$esbase/_search/scroll" -Body $query -Credential $cred -Verbose
#Continue loop until the last query returns 0 results.
} until ($isr.hits.hits.count -eq 0)
You would think it would, but it throws an error that I've posted a topic about in the Logstash forums as well as on the plugin's github page, neither has received any attention.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.