We are using ES 6.2.4 and the contemporary Logstash version. We have a setup where we have Logstash read from a Kafka topic and write to ES. Some of the messages are larger than Kafka default message size so we increased the consumer buffer size and since then the messages were going in fine. We ran a load test loading all the log files into Kafka in x hours and it worked fine i.e., no lines dropped. We then increased the load to load all log files into Kafka in x/2 hours (double the load) and we see about 1/3 of lines dropped. None were dropped in Kafka (we measured the offsets before and after load test). The ES boxes had high CPU and disk utilization (over 80%) and Logstash logs had messages noting ES nodes are reachable or pool is empty. Similarly ES also had messages that it could not reach some nodes in the cluster. At the end of the run one heavily loaded index was yellow with one unassigned shard. The reason (allocation/explain) was "explanation": "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active". The rest of the votes were either allow or worse_balance or not allow from the node that has the primary shard.
Now all this is fine under load (if shards are getting constantly allocated) there isn't a chance to re balance some shard (or the shard in question) and due to GC or something the nodes could not be reached. These errors although with lower frequency would show even in our load tests (with lower load as mentioned above) that were successful. But now we are loosing data!.
Is it seen in the past? Are there some parameters we can change to avoid loss either in Logstash or ES? In the Logstash documentation I saw the ES output plugin tries forever to load data into ES but then there were other timing settings. Here is a snippet from Logstash documentation:
The following errors are retried infinitely:
Network errors (inability to connect)
429 (Too many requests) and
503 (Service unavailable) errors
The retry and give up settings all show deprecated. I saw a message in the past stating data loss at Losing messages during high traffic rate but this was a while ago.
Any tips or suggestions to avoid this situation are welcome. If we are reaching capacity limits how exactly do we determine the capacity limit so we don't get into this situation?