Exporting Indexed Data

wwalker · August 27, 2019, 3:25pm

I am trying to export approximately 100 million indexed events to txt files but having a hard time finding out how to do so. Since I can't get the Elasticsearch input filter to work in Logstash (secured cluster, open ticket on github and in logstash forums), I am using PowerShell to connect to my cluster and pull the data. I've figured out the query I need to run in order to pull just what I want, but I am stuck at only being able to pull 10,000 events.

How do I get all of what I want OR is there a better way to accomplish this?

wwalker · August 27, 2019, 8:54pm

I figured out how to do this with the scroll API, however...at 100 million records, it's gonna take almost 4 straight days to pull it all out. I tried using the slice parameter in my query to break it up and run 16 exports in parallel, but it throws an error saying it's not a legitimate parameter. Below is the PowerShell bits I am using...any ideas on how to incorporate slice or what I can do to speed this up?

If anybody ends up using this for their own setup, you'll need to modify a few lines, especially if you are not using SSL on the API.

#Force PowerShell to use TLS 1.2
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::TLS12

#Get username/password for kibana user
$cred = Get-Credential

#Specify index to query - wildcard ending implied
$is = Read-Host "Enter index to search"

#Set base elasticsearch URL
$esbase = "https://FQDN:9200"

#Set headers
$Options = @{
  headers = @{
    'kbn-version' = "7.2.0"
  }
  ContentType = "application/json"
  Method = "POST"
}
#This can be used as an example, it is querying a date field whose values are in epoch seconds.
$query = '{ "sort": [ "_doc" ], "query" : { "range" : { "edge.timestamp_start": { "gte": "1559329200", "lte": "1561939200" } } } }'

#Initial query specifying parameters and index to search
$isr = Invoke-RestMethod @Options -Uri "$esbase/$is*/_search?scroll=10m&size=10000" -Body $query -Credential $cred -Verbose

#Start loop to process the initial 10k results and subsequent results.
do {

  #Grab the current time to be used in the filename
  $date = Get-Date -Format MMddyyyy_hhmmss

  #Start loop to dump results into a text file
  foreach ($hit in $isr.hits.hits) {

    #Narrow file output data to only the raw message field.  In my case, it's already JSON formatted.
    $out = $hit._source.message

    #Output message field to a text field, appending the existing file so as not to create 10k tiny files.
    $out | Out-File C:\Temp\June$Date.txt -Encoding utf8 -Append
  }

 #Compress the resulting txt file.  In my use case, each txt file was 16MB, compressed was 1.5MB
  Compress-Archive -Path C:\Temp\June$Date.txt -Update -DestinationPath C:\Temp\June$Date.zip -CompressionLevel Fastest

  #Remove original txt file after compression is complete.
  Remove-Item D:\Temp\CloudFlare\June$Date.txt -Confirm:$false

  #Grab the scroll id from the previous successful query to feed into the next query.
  $sid = $isr._scroll_id

  #Reset query timeout to 5 minutes and insert scroll id
  $query = ' {"scroll" : "5m", "scroll_id" : "'+$sid+'" }'
  
  #Gather the next 10k results
  $isr = Invoke-RestMethod @Options -Uri "$esbase/_search/scroll" -Body $query -Credential $cred -Verbose

#Continue loop until the last query returns 0 results.
} until ($isr.hits.hits.count -eq 0)

sspilleman · August 28, 2019, 5:36am

You could try elasticdump with the bulk option, I use it quite often: https://www.npmjs.com/package/elasticdump

wangqinghuan · August 28, 2019, 10:17am

Hi
The latest document of Logstash says " The Logstash Elasticsearch plugins (output, input, filter and monitoring) support authentication and encryption over HTTP."
Couldn't it support your secured Elasticsearch?

wwalker · August 28, 2019, 2:44pm

You would think it would, but it throws an error that I've posted a topic about in the Logstash forums as well as on the plugin's github page, neither has received any attention.

wwalker · August 28, 2019, 2:49pm

I saw that, unfortunately, I run on a windows platform and it looks like ElasticDump is designed for Linux platforms.

system · September 25, 2019, 2:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Exporting big number of entries from elasticsearch Elasticsearch	7	768	July 6, 2017
Export indexed data using curl Elasticsearch	5	1245	August 13, 2019
Bulk Export More Than 10k Events Elasticsearch	1	1148	February 14, 2020
ElasticSearch data to a simple file Elasticsearch	2	286	July 6, 2017
Export data using logstash Logstash	7	1957	July 6, 2017

Exporting Indexed Data

Related topics