ES cluster goes into red frequently

Current version of ES = 5.1.1

When the cluster goes into red state below are the logs

Master server logs
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [elsdata09][10.10.0.102:9200][cluster:monitor/nodes/stats[n]] disconnected

Data host logs
2019-10-31T04:57:26,790 _boss][T#50] [W] rg.ela.clu.act.sha.ShardStateAction - [UID=] - [test-index][9] no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[test-index][9]], allocation id [8ty2aNbFQgS_SQ5-PA4KDQ], primary term [116], message [failed to perform indices:data/write/bulk[s] on replica [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [R], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ]], failure [RemoteTransportException[[elsdata07][10.10.0.100:9200][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [P], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ], state is [STARTED]]; ]]
2019-10-31T04:57:29,419 nect]][T#59] [W] org.ela.dis.zen.UnicastZenPing - [UID=] - [22] failed send ping to {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
java.lang.IllegalStateException: handshake failed with {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
at org.elasticsearch.transport.TransportService.handshake(TransportService.java:370) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:345) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:319) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.discovery.zen.UnicastZenPing$2.run(UnicastZenPing.java:473) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Caused by: org.elasticsearch.transport.NodeDisconnectedException: [10.10.0.51:9200][internal:transport/handshake] disconnected
2019-10-31T09:12:20,825 teTask][T#1] [W] org.ela.dis.zen.ZenDiscovery - [UID=] - master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:

=================================================================================

I could see this index hold the biggest size amongst others indices in the cluster

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
red open test-index YA33n2zsS_GAlNsWvGhMKA 20 1 363067962 242803982 1006.6gb 508.1gb

and below is the reason
test-index 19 p UNASSIGNED ALLOCATION_FAILED
test-index 19 r UNASSIGNED NODE_LEFT

But the all nodes remains in the cluster when verified with _cat/nodes and when I execute the _cluster/reroute?retry_failed=true it gets allocated and cluster becomes green

refresh interval for this index is 30s and there is heavy indexing on the cluster

discovery.zen.minimum_master_nodes: 2 and there are around 8 data nodes in the cluster

Please help as cluster is frequently going into red .

Hi,

just to get some more info, how many master eligible nodes do you have in the cluster? The only number I can see is that

Some more config snippets could help as well. Could you e.g. paste your whole discovery section of the config :slight_smile:

@A_B
discovery.zen.ping.unicast.hosts: ["elsmaster01:9200", "elsmaster02:9200", "elsmaster03:9200"]

discovery.zen.minimum_master_nodes: 2

Nodes connect internally on port 9300, not 9200 like you have configured.

The error message indicates that 9200 is configured somewhere for inter node communication. Can you share your config?

http.port: 9300
transport.tcp.port: 9200
node.ingest: false

discovery.zen.ping.unicast.hosts: ["elsmaster01:9200", "elsmaster02:9200", "elsmaster03:9200"]
discovery.zen.minimum_master_nodes: 2

thread_pool.bulk.queue_size: 300
indices.fielddata.cache.size: 25%
bootstrap.memory_lock: true
http.cors.enabled: false

below is for master node and viceversa for data node
node.master: true
node.data: false

OK. You have for some reason swapped the port numbers...

@Christian_Dahlqvist
Is this issue because of heavy indexing on the cluster or one index of larger size ?
or do I need to add/adjust any parameters related to discovery.zen ??

Do you have anything in the logs about long GC? How much heap do you have configured?

@Christian_Dahlqvist RAM on each data node is 128G and the heap size is 30G.
Not logging GC status to file below parameter is not set
-Xloggc:${loggc}

@Christian_Dahlqvist
grafana shows below young to old ratio which is in seconds v/s msec

How much data do you have in the cluster? What is the full output of the cluster stats API?

@Christian_Dahlqvist

output:
"status" : "green",
"indices" : {
"count" : 38,
"shards" : {
"total" : 818,
"primaries" : 409,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 50,
"avg" : 21.526315789473685
},
"primaries" : {
"min" : 1,
"max" : 25,
"avg" : 10.763157894736842
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 699723935,
"deleted" : 419252483
},
"store" : {
"size" : "1.7tb",
"size_in_bytes" : 1947954897796,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 11069543,
"hit_count" : 2096614,
"miss_count" : 8972929,
"cache_size" : 0,
"cache_count" : 47989,
"evictions" : 47989
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 10627,
"memory" : "1.7gb",
"memory_in_bytes" : 1840137781,
"terms_memory" : "891.3mb",
"terms_memory_in_bytes" : 934695361,
"stored_fields_memory" : "103.5mb",
"stored_fields_memory_in_bytes" : 108586152,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "35.2mb",
"norms_memory_in_bytes" : 36994816,
"points_memory" : "366.6mb",
"points_memory_in_bytes" : 384416280,
"doc_values_memory" : "358mb",
"doc_values_memory_in_bytes" : 375445172,
"index_writer_memory" : "3.5gb",
"index_writer_memory_in_bytes" : 3780304009,
"version_map_memory" : "270.1kb",
"version_map_memory_in_bytes" : 276584,
"fixed_bit_set" : "252.9mb",
"fixed_bit_set_memory_in_bytes" : 265277832,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 11,
"data" : 8,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 0
},
"versions" : [
"5.1.1"
],
"os" : {
"available_processors" : 268,
"allocated_processors" : 268,
"names" : [
{
"name" : "Linux",
"count" : 11
}
],
"mem" : {
"total" : "1tb",
"total_in_bytes" : 1102945484800,
"free" : "39.5gb",
"free_in_bytes" : 42438520832,
"used" : "987.6gb",
"used_in_bytes" : 1060506963968,
"free_percent" : 4,
"used_percent" : 96
}
},
"process" : {
"cpu" : {
"percent" : 217
},
"open_file_descriptors" : {
"min" : 554,
"max" : 2103,
"avg" : 1615
}
},
"jvm" : {
"max_uptime" : "37.2d",
"max_uptime_in_millis" : 3222214849,
"versions" : [
{
"version" : "1.8.0_212",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.212-b03",
"vm_vendor" : "AdoptOpenJDK",
"count" : 11
}
],
"mem" : {
"heap_used" : "105gb",
"heap_used_in_bytes" : 112805252816,
"heap_max" : "244.2gb",
"heap_max_in_bytes" : 262254886912
},
"threads" : 3045
},
"fs" : {
"total" : "14.2tb",
"total_in_bytes" : 15668896378880,
"free" : "12.4tb",
"free_in_bytes" : 13656508588032,
"available" : "12.4tb",
"available_in_bytes" : 13656508588032,
"spins" : "true"
},
"plugins" : [
{
"name" : "elasticsearch-monitoring",
"version" : "5.1.1",
"description" : "Elasticsearch Monitoring Plugin",
"classname" : "com.qualys.elasticsearch.plugins.monitoring.MonitoringPlugin"
},
{
"name" : "elasticsearch-http-basic",
"version" : "5.1.1",
"description" : "Elasticsearch Http Basic Auth Plugin",
"classname" : "com.qualys.elasticsearch.plugins.http.HttpBasicAuthPlugin"
}
],
"network_types" : {
"transport_types" : {
"netty4" : 11
},
"http_types" : {
"netty4" : 11
}
}
}
}

@Christian_Dahlqvist @DavidTurner could you please help ?

I do not see anything obviously wrong but you are running an old version together with some plugins I have not seen before, so am not sure to what extent that could contribute.

Please don't ping me like that. I'm not paying attention to this thread, at least partly because it's about a version that is so far past the end of its supported life.

couple of data nodes went out of cluster now with below error :

2019-11-01T12:58:08,965 eric][T#423] [W] res.suppressed - [UID=] - path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.cache.clear.TransportClearIndicesCacheAction.checkGlobalBlock(TransportClearIndicesCacheAction.java:133) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.cache.clear.TransportClearIndicesCacheAction.checkGlobalBlock(TransportClearIndicesCacheAction.java:48) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.(TransportBroadcastByNodeAction.java:256) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:173) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.ingest.IngestProxyActionFilter.apply(IngestProxyActionFilter.java:79) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:171) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:145) ~[elasticsearch-5.1.1.jar:5.1.1]

I would recommend uninstalling the custom plugins to see how that affects the situation.

Acknowledged !! Thanks

Sure I will do that .Also do you suspect the issue with capacity of elsmaster/data nodes ?

Specs :
Master nodes (each)
RAM - 8G
HEAPSIZE - 3.9G
CPU - 4

Data node (each)
RAM - 128 G
HEAPSIZE - 30G
CPU - 32