My setup:
I have a very large line separated JSON file in the order of TBs. I want to ingest this file as quickly as possible into my Elasticsearch server, I have enough hardware requirements and a cluster setup using Kubernetes.
I have tried many changes such as modifying batch size, number of workers, number of shards, memory, cpu, jvm memory and some other parameters. The maximum ingestion rate I am obtaining is around 0.1 million documents per seconds. I want to scale it to atleast 10 million docs per seconds.
Do logstash properly parallelised the ingestion or I am missing something? I can split the file in multiple parts, but how do I configure logstash to ingest those smaller files in parallel?
A couple of things that need to be clear, the ingestion rate you will get will depend way more on the speed of the source disk and the specs of your Elasticsearch nodes.
I would say that in the majority of the issues with ingestion the bottleneck is Elasticsearch that cannot process the events fast enough.
What are the specs of your Elasticsearch nodes? CPUs, RAM, Heap and specially the disk type. Have you already read the doc on how to tune Elasticsearch for indexing speed?
0.1 million events per second is 100k e/s, which is a pretty good event rate, 10 million is pretty heavy and very hard to reach without a lot of resources on Elasticsearch side.
The filters and output will be run in parallel, but not the file input, so if you have a big file, logstash will read it sequentially.
If you split the big file in multiple smaller files, you would need to use multiple different file inputs, each one of them read from some files or file patterns, but keep in mind that if they are in the same source disk it can have the opposite effect as you would have more I/O requests which could make things slower.
Is this a one time ingestion or something that needs to be done in real time on this same scale?
I have tuned Elastic search and have tried multiple tuning options the document mentions. This is going to be a one time thing mostly and I just want to quickly upload my input data to the server.
The benchmark of 10 million docs/s I am mentioned about was with esrally bench marking tool. But it is not standard and a bit messy to use, therefore I tried logstash for this purpose.
I have multiple nodes setup in a kubernetes cluster with 200 cores and Tbs of RAM with very fast disk as well. Therefore it is not going to be a system bottleneck anywhere (I was able to achieve the above rate using esrally anyways)
Can you provide a more general solution to this problem (not even with Logstash), my target is to just ingest the a very large file into the server.
What options have you changed? In your other question about a template, your template didn't change the default refresh_interval which can heavily impact on performance.
What is the refresh_interval that you set for your index? Also, did you create a mapping for your data or is using the dynamic mapping? This also impacts on performance.
Please share the specs of your Elasticsearch, the elasticsearch.yml and the tuning options you changed, without knowing what you already did it is complicated to provide any feedback.
Also, how many TBs are you talking about in this file? 10s TBs? 100s TBs?
Another thing, is this mount path a network share?
mountPath: /mnt/logstash/
This is no trivial issue, there are many variables, disk speed, tuning configurations, network speed etc.
I do not use Logstash to read files anymore, I normally put all my data in some Kafka clusters and have multiple logstash reading from the Kafka topics, but to get your data into Kafka you would also need some tool to read the file and put it on the topics.
Not sure if this is justified for a one time ingestion or even if this would change anything.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.