Hi,
In my setup of Elasticsearch cluster version 6.8 i've had big problems with handling the huge amount of indices and shards related to this.
We use a frontend application from a third-party vendor so we have very limited settings to apply to our templates for all the indices.
Every day we create about 10 indices with 3 shards each. Some indices are bigger like 50GB whilst the smallest ones are just a few hundred MB. Unfortunately we cannot have different shard settings for the different indices without interfering with the application itself.
I want to add that in the beginning we used 10 shards per index and 1 replica.. later on 5 shards and since a week we have configured 3 shards per index. (because of instructions from our third party application)
Therefore the amount of shards is that huge.
Regards,
Tony
We only index data in the daily indices so after 24 hours we don't do anything else to that data than some user based searches.
My intention is that all data within a month should be quick to search and older data than one month is OK to have slower querys. Therefore i've frozen indices older than one month.
Because of the huge amount of indices and shards we still need plenty of hot data nodes especially if we want to keep atleast one month of indices on those nodes.
Our cluster is oversized in all aspects except one and that is to handle the amount of shards per node.
We need to be able to have 6 months of live searchable data so I can't close indices younger than that.
So my questions is
Is it normal behavior to have RED cluster state for a short period when indices has been reconfigured to Freezed? My guess is that they need to close index to free up memory and then reopen as Freezed?
Why do we have issues with timeouts on relocating shards?
Why do we have lots of garbage collections overhead on our master server?
(about 300 ms every second) And is this related to the question above?
Some cluster settings and stats below
{
"_nodes" : {
"total" : 36,
"successful" : 36,
"failed" : 0
},
"cluster_name" : "elasticsearch",
"cluster_uuid" : "a48YZD-jS8y3ZA7oRFgV6A",
"timestamp" : 1618309513714,
"status" : "green",
"indices" : {
"count" : 1088,
"shards" : {
"total" : 12664,
"primaries" : 6332,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 20,
"avg" : 11.639705882352942
},
"primaries" : {
"min" : 1,
"max" : 10,
"avg" : 5.819852941176471
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 12078058746,
"deleted" : 1016462
},
"store" : {
"size" : "10.2tb",
"size_in_bytes" : 11238806411712
},
"fielddata" : {
"memory_size" : "70.6mb",
"memory_size_in_bytes" : 74087576,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "3.3gb",
"memory_size_in_bytes" : 3607510315,
"total_count" : 17263618,
"hit_count" : 3685677,
"miss_count" : 13577941,
"cache_size" : 81824,
"cache_count" : 581471,
"evictions" : 499647
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 46154,
"memory" : "9.8gb",
"memory_in_bytes" : 10590345368,
"terms_memory" : "7.9gb",
"terms_memory_in_bytes" : 8538526673,
"stored_fields_memory" : "828mb",
"stored_fields_memory_in_bytes" : 868315608,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "16kb",
"norms_memory_in_bytes" : 16384,
"points_memory" : "641.4mb",
"points_memory_in_bytes" : 672602535,
"doc_values_memory" : "487.2mb",
"doc_values_memory_in_bytes" : 510884168,
"index_writer_memory" : "696.2mb",
"index_writer_memory_in_bytes" : 730110774,
"version_map_memory" : "15.3mb",
"version_map_memory_in_bytes" : 16110759,
"fixed_bit_set" : "20.7mb",
"fixed_bit_set_memory_in_bytes" : 21723760,
"max_unsafe_auto_id_timestamp" : 1618272003239,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 36,
"data" : 31,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 2
},
"versions" : [
"6.8.7"
],
"os" : {
"available_processors" : 304,
"allocated_processors" : 304,
"names" : [
{
"name" : "Linux",
"count" : 36
}
],
"pretty_names" : [
{
"pretty_name" : "Red Hat",
"count" : 2
},
{
"pretty_name" : "OpenShift",
"count" : 34
}
],
"mem" : {
"total" : "2tb",
"total_in_bytes" : 2210329694208,
"free" : "106.8gb",
"free_in_bytes" : 114770812928,
"used" : "1.9tb",
"used_in_bytes" : 2095558881280,
"free_percent" : 5,
"used_percent" : 95
}
},
"process" : {
"cpu" : {
"percent" : 57
},
"open_file_descriptors" : {
"min" : 1081,
"max" : 1772,
"avg" : 1623
}
},
"jvm" : {
"max_uptime" : "131.9d",
"max_uptime_in_millis" : 11397246426,
"versions" : [
{
"version" : "1.8.0_272",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.272-b10",
"vm_vendor" : "Red Hat, Inc.",
"count" : 25
},
{
"version" : "1.8.0_262",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.262-b10",
"vm_vendor" : "Oracle Corporation",
"count" : 5
},
{
"version" : "1.8.0_282",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.282-b08",
"vm_vendor" : "Red Hat, Inc.",
"count" : 2
},
{
"version" : "1.8.0_275",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.275-b01",
"vm_vendor" : "Red Hat, Inc.",
"count" : 4
}
],
"mem" : {
"heap_used" : "385.5gb",
"heap_used_in_bytes" : 413943007160,
"heap_max" : "1022gb",
"heap_max_in_bytes" : 1097457205248
},
"threads" : 6663
},
"fs" : {
"total" : "62.7tb",
"total_in_bytes" : 68988481654784,
"free" : "49.5tb",
"free_in_bytes" : 54455427350528,
"available" : "49.5tb",
"available_in_bytes" : 54455427350528
},
"plugins" : [
{
"name" : "prometheus-exporter",
"version" : "6.8.7.0",
"elasticsearch_version" : "6.8.7",
"java_version" : "1.8",
"description" : "Export Elasticsearch metrics to Prometheus",
"classname" : "org.elasticsearch.plugin.prometheus.PrometheusExporterPlugin",
"extended_plugins" : [ ],
"has_native_controller" : false
}
],
"network_types" : {
"transport_types" : {
"security4" : 36
},
"http_types" : {
"security4" : 36
}
}
}
}
Cluster settings:
{
"persistent" : {
"cluster" : {
"routing" : {
"allocation" : {
"awareness" : {
"attributes" : "physical"
},
"enable" : "all",
"node_initial_primaries_recoveries" : "10"
}
}
},
"indices" : {
"recovery" : {
"max_bytes_per_sec" : "80Mb"
}
},
"xpack" : {
"monitoring" : {
"collection" : {
"enabled" : "true"
}
}
}
},
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"node_concurrent_recoveries" : "6",
"exclude" : {
"_ip" : ""
}
}
}
},
"indices" : {
"recovery" : {
"max_bytes_per_sec" : "1Gb"
}
}
}
}
Regards,
Tony