How can I send large JSON file (6 GB) to Elasticsearch using bulk API?

Hello everybody!

I have problems since a few days ago, when I try to send a large JSON file (aprox. 6 GB) to Elasticsearch using Bulk API. Before putting this question I have documented a lot and I saw there are two possibilities to send data to Elasticsearch: Bulk API or Logstash. In fact, Logstash uses behind the Bulk functionality. I know that when you want to send large files to Elasticsearch, you have to take into consideration the HTTP limitation, which is aprox. 2 GB, because data is firstly loaded into memory and then sent to Elasticsearch. Consequently, I split the large JSON file into smaller files, each of 350 MB (100.000 lines), using:
split -l 100000 -a 6 large_data.json /home/.../Chunk_Files.
Afterwards, I tried to send each one using the curl command:
curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@Chunk_Filesaaaaaa.json, but I get nothing in terminal (neither success, nor error). Also I don't get anything in Elasticsearch. I have to mention that my file contains 100.000 lines of this form: {"_index":"filename-log-2020.04","_type":"logevent","_id":"blabla","_score":1,"_source":....
If you have any idea where I am wrong or you could give me other alternatives which may work I would be very happy! There should be professional people who know a solution to my problem.

Thanks in advance!

The limit in Elasticsearch is 100 MB but the generally recommended bulk size is in single digit MB so I would recommend you try with a couple of thousand documents per bulk request rather than 100k. Also make sure your file follows the bulk API format. The formt you showed does not seem to be the correct one.

Hi @Christian_Dahlqvist!

Thank you for your reply! Regarding to the recommended bulk size you said, I would get for my large JSON file, a total of 683 files each one of 9 MB. It is quite much. I don't want to imagine if I would need to send 50 GB file (there would be a lot of files and I would need a script to send all these chunks). Do you maybe know another method which could be more elegant and simpler? Maybe using Logstash?
As regards the Bulk API format, I had a look to the link you posted and added { "index" : before the statement I posted above. So now I send lines which begin with:
{ "index" : {"_index":"filename-log 2020.04","_type":"logevent","_id":"blabla","_score":1,"_source":...., but I get the following error:
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_score]"}],"type":"illegal_argument_exception","reason":"Action/metadata line [1] contains an unknown parameter [_score]"},"status":400.
Could you please tell me what changes should I do or what should I change in order to make it work? It is strange because they are logs from an index pattern and I believe should send them without any change in order to work.

The data has to exactly follow the bulk format, which includes a correctly formatted header and the source on a separate line. This will require a script or the use of e.g. Logstash. Using Logstash will ensure your data is inserted in parallel so is probably the easiest way to go forward.

Finally, I have succeeded in sending the whole JSON file to Elasticsearch, using Logstash. There was no need to split the file in chunks. I thank a lot to the guys in the post discuss.elastic.co which inspired me very much. The logs they had are very similar with mines. This way, I want to post here my config. file in order to help other people which may have similar problems like me:

input {
  file {
    path => ["/home/...../file.json"]
    start_position => "beginning"
    sincedb_path => ["/home/...../sincedb"]
    codec => "json"
  }
}

filter {
 mutate {
  rename => { "_id" => "idoriginal" }
  rename => { "_index" => "indexoriginal" }
  rename => { "_type" => "typeoriginal" }
  rename => { "_source" => "sourceoriginal" }
 }
}

output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "jsonlogs-%{+YYYY.MM.dd}"
  }
  
  stdout {
	codec => rubydebug
  }
}

If it happens at some time after you started Logstash to have more than 90% of disk full (because of the large data you are sending to Elasticsearch), having the error like here discuss.elastic.co, you have to go in Dev Tools/Console and execute the code below:

PUT /jsonlogs-2020.04.20(in fact, your index name)/_settings
{
"index.blocks.read_only_allow_delete": null
}

Hope it will be useful for all the people who had or will have similar problems like me. Best wishes!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.