I have a 5.1 cluster of about 200 million documents spread across 30 indices (150 shards), running on 10 nodes. I index about 2,000 documents per minute and run around 6,000 queries per minute.
I have an automated job calling the snapshot API to backup the entire cluster once per hour. About 30 seconds after the snapshot job starts (according to /_cat/snapshots), I start to see a wave of timeouts trying to index and query. Roughly half the index/query requests take longer than the timeout (3 seconds) while the rest complete normally (around 50 ms response time). The unresponsiveness typically lasts around 30 seconds, but can take anywhere from 5 seconds to one minute. During this window I can see from the tasks API that hundreds of read and write requests are piling up across all nodes, with the running time showing between 2 and 9 seconds.
Has anyone seen this behavior and is there a way to address it? Thanks.