I have log stash setup to read my csv files and send them to ES. The csv files are ~6.5 gb each, with ~4.7 million records per csv file and 400 columns per record. I am able to load about 2 million records/hr. Currently I have a 100 csv files with about 470 million records. My elastic search currently consists of a single node with 30 shards. I have disabled indexing and replicas. Would I benefit from another instance of Logstash from another server. Logstash is currently running a 4 core Xeon processor. I set the -w flag to 4. I have also assigned additional memory to the JVM. My concern is will I overwhelm ES by adding another Logstash instance or will I at least be able to double my throughput. Logstash is running on a cloud server so I can always scale up resources as well. I noticed with the 4 core I am seeing 429 errors. It does not seem additional processors would make much of difference if I am already seeing 429 errors with 4 processors. My ES cluster is currently hosted in Elastic Cloud.
If you are seeing 429 errors you are probably already overwhelming the cluster. What is the size of your Elastic Cloud cluster? Which version of Logstash and Elasticsearch are you using?
Elasticsearch 5 and Logstash 5.2. I'm on a single node cluster with 32gb ram and 760 gb hard drive. It's hosted in Elastic Cloud
It seems like Elasticsearch is currently the bottleneck. As you are indexing into a large number of shards, increasing the internal batch size (-b) may help. Indexing into fewer shards at a time may also help, e.g. by splitting the load into a number of indices with fewer shards each (10 indices of 3 shards, each covering 10% of the files?)
You can also increase the size of the box, which will give you access to more CPU.
Another way to increase throughput would be to make Elasticsearch do less work per document by optimising mappings, e.g. specifying types for all the fields and not rely on dynamic mappings, which tend to index all text fields both as keyword and text.
I think you for your suggestions. I think I will need to increase the size of the box.
That is generally the easiest option, especially if you can do so just temporarily while you bulk load all these files. Optimising mappings takes more effort, but can also help reduce the amount of disk space your data takes up once indexed, as illustrated in this blog post, which is quite useful even though it is getting old.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.