Influence of Master Nodes and Large Cluster State on Search

itays · May 16, 2021, 9:38am

Hey,

We are running in production an elasticsearch cluster. The setup is:

10 data nodes - 8 cpus and 64G ram, half is heap
3 master nodes - 4 cpus and 16g ram, half is heap
In our cluster we have large number of indices, around 18,000 with 1 replicas (36,000 shards)

We have a problem of getting slow search responses once in a couple of search requests (investigate the profile api led to conclustion it's not someting about the search itself, as the profiling metrics looks good, but something else in the infra).

Our suspicion is that the problem is the large number of indices we have on the cluster, or maybe something similar which related to the master nodes. After increasing the cpu and memory of the master nodes we got much better search performance in our app.

Looking in the elastic docs we found this:
Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do not normally interact with the master node. The responsibility of the master node is to maintain the global cluster state and reassign shards when nodes join or leave the cluster. Each time the cluster state is changed, the new state is published to all nodes in the cluster as described above.

It seems that the master nodes are not involved in the high-throughput apis, so how our problems and the improvement in search due to the change in the master nodes can be explained?
We will be happy to understand all of this as it is not clear from the docs.

Thanks

Christian_Dahlqvist · May 16, 2021, 9:52am

Are you making sure you are not sending requests to the dedicated master nodes? Which version of Elasticsearch are you using?

Might it be something else that gets the master nodes busy, e.g. use of dynamic mappings that cause cluster updates in large numbers?

warkolm · May 17, 2021, 12:22am

That's far too many for the number of nodes you have, ~3600 per node!

itays · May 18, 2021, 7:34am

Yes that true, But I still want to understand how does it suppose to affect search latency and the reason for the weird behaviour of getting slow response once in a couple of requests.

itays · May 18, 2021, 7:39am

We also got the weird behaviour of getting slow response once in a couple of requests while sending the request to non master nodes.

We are using Version 7.8.0.

I understand the master nodes are busy but I still don't get why should it affect search latency as long as the cluster is healthy.

warkolm · May 18, 2021, 10:01am

What is the output from the _cluster/stats?pretty&human API?

itays · May 18, 2021, 10:26am

{
"_nodes" : {
"total" : 16,
"successful" : 16,
"failed" : 0
},
"cluster_name" : "elastic-app-3",
"cluster_uuid" : "lHGYdcS3SZSommD90SjMsw",
"timestamp" : 1621333544411,
"status" : "green",
"indices" : {
"count" : 18675,
"shards" : {
"total" : 37350,
"primaries" : 18675,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 2,
"avg" : 2.0
},
"primaries" : {
"min" : 1,
"max" : 1,
"avg" : 1.0
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 237942647,
"deleted" : 20796261
},
"store" : {
"size" : "3.3tb",
"size_in_bytes" : 3630614023017
},
"fielddata" : {
"memory_size" : "2.4mb",
"memory_size_in_bytes" : 2555664,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "17.8mb",
"memory_size_in_bytes" : 18700027,
"total_count" : 2159081,
"hit_count" : 127302,
"miss_count" : 2031779,
"cache_size" : 7108,
"cache_count" : 9693,
"evictions" : 2585
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 224177,
"memory" : "2.3gb",
"memory_in_bytes" : 2556240876,
"terms_memory" : "1.6gb",
"terms_memory_in_bytes" : 1778862952,
"stored_fields_memory" : "110.9mb",
"stored_fields_memory_in_bytes" : 116314216,
"term_vectors_memory" : "117mb",
"term_vectors_memory_in_bytes" : 122740800,
"norms_memory" : "300.4mb",
"norms_memory_in_bytes" : 315089664,
"points_memory" : "0b",
"points_memory_in_bytes" : 0,
"doc_values_memory" : "212.8mb",
"doc_values_memory_in_bytes" : 223233244,
"index_writer_memory" : "20.6gb",
"index_writer_memory_in_bytes" : 22121363846,
"version_map_memory" : "133.7mb",
"version_map_memory_in_bytes" : 140211008,
"fixed_bit_set" : "0b",
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
},
"mappings" : {
"field_types" : [
{
"name" : "boolean",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "date",
"count" : 37350,
"index_count" : 18675
},
{
"name" : "flattened",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "integer",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "keyword",
"count" : 186750,
"index_count" : 18675
},
{
"name" : "object",
"count" : 74700,
"index_count" : 18675
},
{
"name" : "text",
"count" : 504225,
"index_count" : 18675
},
{
"name" : "token_count",
"count" : 18675,
"index_count" : 18675
}
]
},
"analysis" : {
"char_filter_types" : ,
"tokenizer_types" : [
{
"name" : "char_group",
"count" : 37350,
"index_count" : 18675
},
{
"name" : "edge_ngram",
"count" : 18675,
"index_count" : 18675
}
],
"filter_types" : [
{
"name" : "edge_ngram",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "length",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "shingle",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "stemmer",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "stop",
"count" : 18675,
"index_count" : 18675
}
],
"analyzer_types" : [
{
"name" : "custom",
"count" : 280125,
"index_count" : 18675
}
],
"built_in_char_filters" : ,
"built_in_tokenizers" : [
{
"name" : "standard",
"count" : 93375,
"index_count" : 18675
},
{
"name" : "whitespace",
"count" : 74700,
"index_count" : 18675
}
],
"built_in_filters" : [
{
"name" : "asciifolding",
"count" : 280125,
"index_count" : 18675
},
{
"name" : "lowercase",
"count" : 280125,
"index_count" : 18675
},
{
"name" : "remove_duplicates",
"count" : 37350,
"index_count" : 18675
},
{
"name" : "word_delimiter_graph",
"count" : 74700,
"index_count" : 18675
}
],
"built_in_analyzers" : [
{
"name" : "keyword",
"count" : 18675,
"index_count" : 18675
},
{
"name" : "standard",
"count" : 18675,
"index_count" : 18675
}
]
}
},
"nodes" : {
"count" : {
"total" : 16,
"coordinating_only" : 0,
"data" : 10,
"ingest" : 10,
"master" : 3,
"ml" : 16,
"remote_cluster_client" : 16,
"transform" : 10,
"voting_only" : 0
},
"versions" : [
"7.8.0"
],
"os" : {
"available_processors" : 116,
"allocated_processors" : 116,
"names" : [
{
"name" : "Linux",
"count" : 16
}
],
"pretty_names" : [
{
"pretty_name" : "CentOS Linux 7 (Core)",
"count" : 16
}
],
"mem" : {
"total" : "751gb",
"total_in_bytes" : 806380109824,
"free" : "88.3gb",
"free_in_bytes" : 94815694848,
"used" : "662.6gb",
"used_in_bytes" : 711564414976,
"free_percent" : 12,
"used_percent" : 88
}
},
"process" : {
"cpu" : {
"percent" : 341
},
"open_file_descriptors" : {
"min" : 630,
"max" : 29896,
"avg" : 18554
}
},
"jvm" : {
"max_uptime" : "68.7d",
"max_uptime_in_millis" : 5943619353,
"versions" : [
{
"version" : "14.0.1",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "14.0.1+7",
"vm_vendor" : "AdoptOpenJDK",
"bundled_jdk" : true,
"using_bundled_jdk" : true,
"count" : 16
}
],
"mem" : {
"heap_used" : "242.7gb",
"heap_used_in_bytes" : 260628786424,
"heap_max" : "408gb",
"heap_max_in_bytes" : 438086664192
},
"threads" : 2688
},
"fs" : {
"total" : "22.9tb",
"total_in_bytes" : 25194268372992,
"free" : "19.5tb",
"free_in_bytes" : 21524646838272,
"available" : "19.5tb",
"available_in_bytes" : 21524646838272
},
"plugins" : [
{
"name" : "repository-azure",
"version" : "7.8.0",
"elasticsearch_version" : "7.8.0",
"java_version" : "1.8",
"description" : "The Azure Repository plugin adds support for Azure storage repositories.",
"classname" : "org.elasticsearch.repositories.azure.AzureRepositoryPlugin",
"extended_plugins" : ,
"has_native_controller" : false
},
{
"name" : "prometheus-exporter",
"version" : "7.8.0.0",
"elasticsearch_version" : "7.8.0",
"java_version" : "1.8",
"description" : "Export Elasticsearch metrics to Prometheus",
"classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
"extended_plugins" : ,
"has_native_controller" : false
},
{
"name" : "repository-s3",
"version" : "7.8.0",
"elasticsearch_version" : "7.8.0",
"java_version" : "1.8",
"description" : "The S3 repository plugin adds S3 repositories",
"classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
"extended_plugins" : ,
"has_native_controller" : false
}
],
"network_types" : {
"transport_types" : {
"security4" : 16
},
"http_types" : {
"security4" : 16
}
},
"discovery_types" : {
"zen" : 16
},
"packaging_types" : [
{
"flavor" : "default",
"type" : "docker",
"count" : 16
}
],
"ingest" : {
"number_of_pipelines" : 2,
"processor_stats" : {
"gsub" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time" : "0s",
"time_in_millis" : 0
},
"script" : {
"count" : 0,
"failed" : 0,
"current" : 0,
"time" : "0s",
"time_in_millis" : 0
}
}
}
}
}

warkolm · May 18, 2021, 10:35am

That's 3735 shards per node with an average shard size of less than a gig.
You are seriously wasting way too much heap managing those shards, which is likely the majority of the issue you have.

The only thing I can think of is that your master nodes can better handle the allocation tables of the sheer number of shards you have in your cluster with more resources. Frankly though, focussing on that is missing the forest for the trees, and if you don't fix that you are just asking for continued problems.

system · June 15, 2021, 10:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards per node, Heap, 100% CPU....help please Elasticsearch	5	401	August 18, 2021
Configuring a cluster for a large number of indexes Elasticsearch	9	3991	February 15, 2018
Does a lot of indices in elasticsearch cluster have an impact on performace of elasticsearch Elasticsearch	7	871	July 5, 2017
Fine Tuned Cluster - Consultation Elasticsearch	2	597	July 23, 2017
Overall cluster performance is relation to number of shards Elasticsearch	2	612	June 9, 2018

Influence of Master Nodes and Large Cluster State on Search

Related topics