Query performance drop while doing a snapshot

We are running a 6-node Elasticsearch cluster with a single (not really single, but the only one that is actively used) index which is 3.5TB in size (51 shards, 25 primary).

We have a NFS share mounted on all nodes and registered as a snapshot repository.

We triggered a snapshot (the first one), but the query performance degraded extremely during the snapshot so we canceled the snapshot.

My guess is that this is expected, but what can we do to avoid or at least reduce this search performance drop?

Thank you!

That's not supposed to be happening. Did you change any default settings ?

Do you have any particular settings in mind?

Yes. But share what you have. Node settings and index settings please.

Here's the output from the GET /_nodes/stats and GET /index_name/_settings API calls.

If those are not the settings you had in mind please tell me the exact calls you want to see the output from.

Thanks!

(Note that the node, cluster and index names have been sanitized for privacy purposes.)

Can you share your elasticsearch.yml file?

Here you go:

# cat /etc/elasticsearch/elasticsearch.yml | grep -vE '^#'
cluster.name: elastic
node.name: node-3
path.data: /data/elastic,/elastic
path.logs: /var/log/elasticsearch
path.repo: ["/mnt/elasticsearch/snapshot/e1/es-v5"]
network.host: _eth1_,_local_
discovery.zen.ping.unicast.hosts: ["172.31.31.10","172.31.31.21","172.31.31.23","172.31.31.24","172.31.31.25"]
discovery.zen.minimum_master_nodes: 4

The rest of the nodes are configured similarly (the node name obviously differs for example).

Just to clarify - all nodes are at version 5.6.1.

I don't see anything obvious.

Could you run hot_threads API while the snapshot is running ? May be we can find something.

Cc @Igor_Motov

What type of storage do you have? What does disk I/O and iowait look like when you are snapshotting compared to when you are not?

I can't start a snapshot right now. I will post an update when we are able to do a snapshot (tomorrow most probably).

The snapshot process should be throttled enough in order to not affect querying that much. Unless the throttling settings were change or the number of snapshot threads was changed in the thread pool settings, there shouldn't be a visible impact. We also need to consider, that even with throttling snapshot definitely adds some disk I/O, network and a little bit of memory load so if any nodes in the cluster were already very high in resource utilization, it is possible for a snapshot to be the last drop that triggers overload. I saw it a few times with S3 repo, that used to use memory for buffering, but it's not common for shared file system. So, I am a bit puzzled here. @tlrx any other thoughts?

@dadoonet Here's the hot_threads API response.

@Christian_Dahlqvist, the IO wait on the nodes look reasonable during the snapshot - most of the time below 20% (with some peaks, but below 70%).

@Igor_Motov, that's what I expected, thanks for the clarification. The cluster was far from its limits at the time of the snapshot, so not sure why this happened.

Anyway, a snapshot is in progress right now and the performance looks normal.

I'll keep monitoring the service.

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.