I have about 1TB of data splitted into many smaller .json files in newline delimited JSON (NDJSON) format. The sizes of the .json files vary between 500MB and 20GB.
The files are too big to load using the Bulk API. Of course I could split the files into smaller .json files and then load them using the Bulk API.
Is there a more elegant and more efficient way to do this?
This is very helpful, I will test this.
Another question: What about using Logstash to load the .json files into elasticsearch? Is this fast or is it more efficient to use the bulk API for that?
Logstash uses also the bulk API behind the scene.
I'd say that it can be easier to use Logstash as you basically just have to configure it vs writing your own code.
I've now set up Logstash. I could successfully load a single 20gb .json file with it.
However, when I just pass the path of the directory where all my .json files are and a *.json filename pattern to tell Logstash to load all my .json files (~1TB in total, max file size= 20GB), it starts but after loading about a million documents I get the following error:
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid4539.hprof ...
Heap dump file created [825928181 bytes in 3.132 secs]
[2018-04-16T15:49:00,509][ERROR][org.logstash.Logstash ] java.lang.OutOfMemoryError: Java heap space
I allready increased the java max heap size in jvm.options from 1GB to 8GB, which lowered the CPU utilization a bit but I get the same error.
I also tried to reduce the number of logstash workers from 10 to 4 (I have 10 cores), to slow Logstash a bit down, however, I got the same error.
Is it somehow possible to determine on which .json file Logstash was working on when it crashed? Maybe one of the files is corrupt?
You are going to have a a number of problems with memory unless we can find mechanisms to limit the number of documents in-flight plus minimise other memory usage in the file input.
But this is very dependent on the size of each NDJSON line. What is the smallest, largest and median line size in bytes across the files you are trying to ingest?
Some quick win options.
Reduce the batch size to say 10 - bulk indexing will not be so efficient but more of that in a bit.
Reduce the worker count to 2 - means you will have 2 * 10 documents in-flight.
In the file input set max_open_files to 1 and close_older to 5 (seconds) - this will help to reduce the size of the temporary arrays created while iterating over the discovered files.
When deserializing JSON, processing Logstash Event data structures and the serializing the Events into a bulk indexing request we end up with many copies of string in memory, e.g.
The file input reads 32K bytes into a buffer, extracts a line from the buffer and gives it to the codec.
The codec receives the line (same object) but as it decodes it to keys and values it creates new objects to as a Hash or Map like object. For a short while the original line string object and the new decoded objects are in memory at the same time (Java GC delays IIRC)
The codec puts the event into an in-memory Blocking Queue. For the version you have (6.2.3) the size of this queue is set to twice the Events in-flight (pipeline.batch.size * pipeline.workers * 2, 40 from above).
Workers will remove a batch of Events from the Queue and begin processing, mean while the file input puts another 40 events in the Queue - you now have 80 Event data structures in memory.
The ES output is thread safe so it processes all worker requests in parallel if the workers arrive at the output at much the same time. The batch of 20 events is serialized into the bulk-index data structure. Now there is a JSON payload of the 20 events plus the original 20 Event data structures. I also suspect that the actual HTTP transmission to ES might make a copy of the JSON.
So you see with very large events and the standard settings a lot of memory can be used up.
Thanks a lot for this response.
I applied your suggestions, but I still get the same error. Also this slows down the indexing process tremendously. It would be great to find a solution with better performance.
Meanwhile profiled the java heap using VisualVM as described in (https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html#profiling-the-heap) to understand better whats happening.
Here is the VisualVM measurement:
...At some point for some reason the java heap seems to explode.
An other obseervation I just made is that it always fails at the same point (after loading 1,331,027 documents), both with the standard configuration and as well with the configuration that guyboertje proposed. Does this indicate that maybe with one of my json files something is wrong?
How can I check, which .json file Logstash was procesing when it failed?
I've now splitted my .json files such that they maximally contain 150'000 documents per .json. This brings the maximum file size down to 3GB, however most files are <1GB now.
Still, I get this java heap space error, also when running it with the configuration that guyboertje proposed:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.