We have an application that 3 times a day indexes 100K-200K documents per node in 2 minutes, which causes huge spikes in GC time and pressure on the cluster.
I tried 1MB bulks (will try to raise it), resizing the bulk threadpool, and changing refresh interval to 10s.
Ultimately you need to pay the "cost" of indexing, so other than spreading the load over a longer time, you could add more nodes during the indexing process.
Better the ingestion can happen smoothly. Say , to make 2 mins to be 20 mins or even consume them from Kafka. But anyway , a profile could help to find the bottle neck. When heavy ingestion happen, cpu memory io can all consumed a lot. Simply you can do vmstat to see the possible bottle neck.
Besides, I think G1 could also be tried. In my test with all default settings when ingest 2m docs to ES-2.2,
G1 totally stop for 3s 318ms (362 times) while cms stop for 8s 24ms (1190 times). So if a lot of full gc happens even concurrent mode failure. G1 could be an option when jdk8 is being used.
problem solved. For other users that encounter this problem:
Obviously, the division of the bulks for a longer timeframe helped.
Important note -
We had 4 nodes for the cluster, with 4 CPU cores each. every node holds 10 shards (including replicas).
We added 2 CPU cores for each node, which mitigated the GCs significantly. Important to say, that we didn't see the CPU working so hard, so this came as a surprise.
My guess is that the indexing is per shard, so the CPU had to switch a lot, which delayed the concurrent searches. Alternatively, I guess we could have reindexed the cluster to a significantly smaller amount of shards, and it would have helped too, since the CPU wasn't even close to reaching its limit.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.