I'm trying to import some very large JSON files (up to 80GB / file) into Elasticsearch and have tried a couple different approaches but neither is giving me an efficient working solution:
-
Using the BULK API - I received heap memory errors (even with export ES_HEAP_SIZE=4g) so decided to write a bash script to break the JSON file up before sending the JSON information to Elasticsearch, see the marked solution here for more info. This works but is extremely slow for files this large. To continue with this approach I'd need to improve the bash script or write a program to split the files more efficiently.
-
Using Logstash - I have also tried indexing/updating via Logstash. I set export LS_HEAP_SIZE=4g. It's working for my medium size files (~4GB) but not the larger (80GB) ones. When I try to send the largest files I get this error:
HTTP content length exceeded 104857600 bytes
My .conf is simply:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/path/file.json"]
start_position => "beginning"
}
}
filter {
json{
source => "message"
}
}
output {
elasticsearch {
hosts => "localhost"
index => "indexname"
document_type=>"subject"
document_id=>"%{id}"
action=>"update"
}
}
Prior to running this .conf for the large files I run a .conf which has every subjects id in it using the action=>"index" command. For all files I've aggregated the documents under their appropriate id. For the largest files there can be thousands of documents under a given id.
From what I'm seeing online it seems unrealistic to try to import a single file around 80GB. Can I use the filters in Logstash to break the file up into smaller chunks prior to indexing/updating? If not, could you provide a suggestion for improving my bash script or a program to use for efficiently breaking up these files? One last note I'm using Windows.