We are running a 6-node Elasticsearch cluster with a single (not really single, but the only one that is actively used) index which is 3.5TB in size (51 shards, 25 primary).
We have a NFS share mounted on all nodes and registered as a snapshot repository.
We triggered a snapshot (the first one), but the query performance degraded extremely during the snapshot so we canceled the snapshot.
My guess is that this is expected, but what can we do to avoid or at least reduce this search performance drop?
The snapshot process should be throttled enough in order to not affect querying that much. Unless the throttling settings were change or the number of snapshot threads was changed in the thread pool settings, there shouldn't be a visible impact. We also need to consider, that even with throttling snapshot definitely adds some disk I/O, network and a little bit of memory load so if any nodes in the cluster were already very high in resource utilization, it is possible for a snapshot to be the last drop that triggers overload. I saw it a few times with S3 repo, that used to use memory for buffering, but it's not common for shared file system. So, I am a bit puzzled here. @tlrx any other thoughts?
@Christian_Dahlqvist, the IO wait on the nodes look reasonable during the snapshot - most of the time below 20% (with some peaks, but below 70%).
@Igor_Motov, that's what I expected, thanks for the clarification. The cluster was far from its limits at the time of the snapshot, so not sure why this happened.
Anyway, a snapshot is in progress right now and the performance looks normal.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.