Note: You might want to mess around with the output.elasticsearch.bulk_max_size and output.elasticsearch.worker settings. Every cluster is different when it comes to the ideal values for bulk_max_size and workers.
What happens now if you have 2 or 3 Filebeats pointed at your 3 node cluster?
I've tried 12 and 16 workers, but didn't see any difference. I'll play around with the settings again.
From the filebeat logs, it looks like the queue gets filled after about 2 minutes, stays full for about 2-3 minutes, which is when the packet drop happens, then completely empty (gc maybe?), before getting filled up again.
I've been trying different combinations of queue.mem.events, queue.mem.flush.min_events and output.elasticsearch.bulk_max_size, but have not seen any significant improvement yet.
When going through the elasticsearch logs, I did see a few JvmGcMonitorService warnings on one of the three nodes (the one last added to the cluster). It only appeared about 5 times in the last week, and only on this one node. No searching was done on the ES at the time, only indexing. Not sure if this is relevant.
[INFO ] [o.e.m.j.JvmGcMonitorService] [node-3] [gc] [young] [214426] [26447] duration [854ms], collections [1]/[1.5s], total [854ms]/[7.6m], memory [19.6gb]->[3.1gb]/[30gb], all_pools {[young] [16.5gb]->[16mb]/[0b]}{[old] [3gb]->[3gb]/[30gb]}{[survivor] [81mb]->[64.2mb]/[0b]}
[WARN ] [o.e.m.j.JvmGcMonitorService] [node-3] [gc] [young] [214426] overhead, spent [854ms] collecting in the last [1.5s]
Each node has 30GB of JVM memory. I currently have a total of 2750 shards, but indexing 8 shards at a time (40GB index, max primary shard size set at 5GB).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.