We are using Logstash to read log data from redis and index it to Elasticsearch 1.7.9. We are in the process of upgrading the Elasticsearch to version 6.5.1, so we have set up a ES6.5 cluster of 11 nodes ( 3:master, 6:data and 2:ingest) with similar configuration as of ES 1.7.
We have two output section in our logstash conf file,
We are trying to index the same data to both old and new cluster thinking that it won't impact the current ES cluster 1.7.9. By the time we start indexing data to both the cluster, we started experiencing slow indexing ( 3mints to 30 mints delay ) in both ES1.7 and ES6.5.
We also noticed TOO_MANY_REQUESTS (429) error in Elasticsearch 6.5 for which Logstash was keep retrying the request indefinitely which was our suspect for slow indexing. Later we increased the thread_pool bulk queue size to fix 429 error which didn't help us in increasing the indexing speed.
When we index only to one elasticsearch cluster ( 1.7.9 ), the indexing speed becomes normal. We didn't try indexing only to new elasticsearch cluster as all our monitor is set only on old cluster.
Here is our cluster configurations,
Logstash (6.5.1) is running with default configurations
Elasticsearch,
5 primary shards and 1 replica shard per index
index refresh interval set to 5s
As we are new to Elastic stack, we are not sure where exactly to look into. Any hint/clue would be much appreciated.
How many indices are you actively indexing into? How many worker threads are you using? Have a look at this blog post for a discussion around bulk rejections.
When you have two Elasticsearch outputs, each batch will be sent to the outputs in sequence, which means that it is likely to take longer, especially is one of the clusters is slower than the other or experiencing issues with bulk rejections as you describe.
It was one index with almost around 40 document types in Elasticsearch 1.7 that turned out to 40 different indices in Elasticsearch 6.5 which we are currently indexing into.
With respect to Logstash, we are living with the default pipeline workers configuration, in that case it should take 8 worker threads per logstash instance as our machine has 8 CPU cores ( we have a total of 6 logstash instances configured) .
With respect to Elasticsearch, we have a fixed thread pool of size 16 with the queue size of 600.
As mentioned earlier, we still see delay in indexing even after fixing the "TOO_MANY_REQUESTS (429)" errors. So we really don't think it is causing the issue. As mentioned by you, It could be due to the slowness of new cluster.
We have an average of 1000-1500 events per second from various sources to redis from where the 6 Logstash instances picks up and indexing it to Elasticsearch.Does the existing configuration hold good for aforementioned event rate?
All other configurations are similar to that of old Elasticsearch 1.7, so we were not sure where to suspect and tune accordingly.
Should i try increasing the index refresh interval and/or decreasing the no. shards per index? as these two settings seems to be directly proportional to the indexing speed. Please let us know if there any documentation available to benchmark Elasticsearch performance?
If you were indexing into a single index with 5 primary shards and now are indexing into 40 indices each with 5 primary shards (200 shards) that probably explains the slowness as you will be writing very small batches per shard and bulk request. I would recommend either consolidating back to a single index (with a field indicating document type that you can filter on) if mappings allow or reduce the number of primary shards per index to 1. You may also want to make sure each index cover a longer time period in order to not end up with a very large number of small shards which can be very inefficient.
We cannot do that as we have fields with same name with different data types. We will end up with the same data type overlapping problem as we faced in ES 1.7.
We have done that, and also deleted all other old indices created with 5 primary shards to make sure the results are appropriate. Still, we noticed the delay in indexing.
Then our suspect moved to Logstash as we are living with the default configuration ( in all 6 instances ) and started tuning each parameters and we were able to see good improvement in performance.
Pipeline Batch Size from 125 (default) to 1000 then to "1600" (stable)
JVM heap space has changed from 1G (default) to 2G then to 4G (stable)
This solved our problem. We are now able to process an average of 5k EPS with no delay at all.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.