ES cluster goes into red frequently

AKSHAY_ARDAK · October 31, 2019, 2:02pm

Current version of ES = 5.1.1

When the cluster goes into red state below are the logs

Master server logs
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [elsdata09][10.10.0.102:9200][cluster:monitor/nodes/stats[n]] disconnected

Data host logs
2019-10-31T04:57:26,790 _boss][T#50] [W] rg.ela.clu.act.sha.ShardStateAction - [UID=] - [test-index][9] no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[test-index][9]], allocation id [8ty2aNbFQgS_SQ5-PA4KDQ], primary term [116], message [failed to perform indices:data/write/bulk[s] on replica [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [R], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ]], failure [RemoteTransportException[[elsdata07][10.10.0.100:9200][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[active primary shard cannot be a replication target before relocation hand off [test-index][9], node[I-c1LgUZQQKG6B2NYb70Wg], [P], s[STARTED], a[id=8ty2aNbFQgS_SQ5-PA4KDQ], state is [STARTED]]; ]]
2019-10-31T04:57:29,419 nect]][T#59] [W] org.ela.dis.zen.UnicastZenPing - [UID=] - [22] failed send ping to {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
java.lang.IllegalStateException: handshake failed with {#zen_unicast_65#}{INV4EcdPSh2kNR2IKKYvVA}{elsmaster02}{10.10.0.51:9200}
at org.elasticsearch.transport.TransportService.handshake(TransportService.java:370) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:345) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.transport.TransportService.connectToNodeLightAndHandshake(TransportService.java:319) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.discovery.zen.UnicastZenPing$2.run(UnicastZenPing.java:473) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Caused by: org.elasticsearch.transport.NodeDisconnectedException: [10.10.0.51:9200][internal:transport/handshake] disconnected
2019-10-31T09:12:20,825 teTask][T#1] [W] org.ela.dis.zen.ZenDiscovery - [UID=] - master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:

=================================================================================

I could see this index hold the biggest size amongst others indices in the cluster

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
red open test-index YA33n2zsS_GAlNsWvGhMKA 20 1 363067962 242803982 1006.6gb 508.1gb

and below is the reason
test-index 19 p UNASSIGNED ALLOCATION_FAILED
test-index 19 r UNASSIGNED NODE_LEFT

But the all nodes remains in the cluster when verified with _cat/nodes and when I execute the _cluster/reroute?retry_failed=true it gets allocated and cluster becomes green

refresh interval for this index is 30s and there is heavy indexing on the cluster

discovery.zen.minimum_master_nodes: 2 and there are around 8 data nodes in the cluster

Please help as cluster is frequently going into red .

A_B · October 31, 2019, 2:25pm

Hi,

just to get some more info, how many master eligible nodes do you have in the cluster? The only number I can see is that

Some more config snippets could help as well. Could you e.g. paste your whole discovery section of the config

AKSHAY_ARDAK · October 31, 2019, 5:57pm

@A_B
discovery.zen.ping.unicast.hosts: ["elsmaster01:9200", "elsmaster02:9200", "elsmaster03:9200"]

discovery.zen.minimum_master_nodes: 2

Christian_Dahlqvist · October 31, 2019, 6:02pm

Nodes connect internally on port 9300, not 9200 like you have configured.

Christian_Dahlqvist · October 31, 2019, 6:09pm

The error message indicates that 9200 is configured somewhere for inter node communication. Can you share your config?

AKSHAY_ARDAK · October 31, 2019, 6:14pm

http.port: 9300
transport.tcp.port: 9200
node.ingest: false

discovery.zen.ping.unicast.hosts: ["elsmaster01:9200", "elsmaster02:9200", "elsmaster03:9200"]
discovery.zen.minimum_master_nodes: 2

thread_pool.bulk.queue_size: 300
indices.fielddata.cache.size: 25%
bootstrap.memory_lock: true
http.cors.enabled: false

below is for master node and viceversa for data node
node.master: true
node.data: false

Christian_Dahlqvist · October 31, 2019, 6:19pm

OK. You have for some reason swapped the port numbers...

AKSHAY_ARDAK · October 31, 2019, 6:28pm

@Christian_Dahlqvist
Is this issue because of heavy indexing on the cluster or one index of larger size ?
or do I need to add/adjust any parameters related to discovery.zen ??

Christian_Dahlqvist · October 31, 2019, 6:33pm

Do you have anything in the logs about long GC? How much heap do you have configured?

AKSHAY_ARDAK · October 31, 2019, 6:42pm

@Christian_Dahlqvist RAM on each data node is 128G and the heap size is 30G.
Not logging GC status to file below parameter is not set
-Xloggc:${loggc}

AKSHAY_ARDAK · October 31, 2019, 7:05pm

@Christian_Dahlqvist
grafana shows below young to old ratio which is in seconds v/s msec

Christian_Dahlqvist · October 31, 2019, 7:20pm

How much data do you have in the cluster? What is the full output of the cluster stats API?

AKSHAY_ARDAK · October 31, 2019, 7:25pm

@Christian_Dahlqvist

output:
"status" : "green",
"indices" : {
"count" : 38,
"shards" : {
"total" : 818,
"primaries" : 409,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 50,
"avg" : 21.526315789473685
},
"primaries" : {
"min" : 1,
"max" : 25,
"avg" : 10.763157894736842
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 699723935,
"deleted" : 419252483
},
"store" : {
"size" : "1.7tb",
"size_in_bytes" : 1947954897796,
"throttle_time" : "0s",
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size" : "0b",
"memory_size_in_bytes" : 0,
"total_count" : 11069543,
"hit_count" : 2096614,
"miss_count" : 8972929,
"cache_size" : 0,
"cache_count" : 47989,
"evictions" : 47989
},
"completion" : {
"size" : "0b",
"size_in_bytes" : 0
},
"segments" : {
"count" : 10627,
"memory" : "1.7gb",
"memory_in_bytes" : 1840137781,
"terms_memory" : "891.3mb",
"terms_memory_in_bytes" : 934695361,
"stored_fields_memory" : "103.5mb",
"stored_fields_memory_in_bytes" : 108586152,
"term_vectors_memory" : "0b",
"term_vectors_memory_in_bytes" : 0,
"norms_memory" : "35.2mb",
"norms_memory_in_bytes" : 36994816,
"points_memory" : "366.6mb",
"points_memory_in_bytes" : 384416280,
"doc_values_memory" : "358mb",
"doc_values_memory_in_bytes" : 375445172,
"index_writer_memory" : "3.5gb",
"index_writer_memory_in_bytes" : 3780304009,
"version_map_memory" : "270.1kb",
"version_map_memory_in_bytes" : 276584,
"fixed_bit_set" : "252.9mb",
"fixed_bit_set_memory_in_bytes" : 265277832,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 11,
"data" : 8,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 0
},
"versions" : [
"5.1.1"
],
"os" : {
"available_processors" : 268,
"allocated_processors" : 268,
"names" : [
{
"name" : "Linux",
"count" : 11
}
],
"mem" : {
"total" : "1tb",
"total_in_bytes" : 1102945484800,
"free" : "39.5gb",
"free_in_bytes" : 42438520832,
"used" : "987.6gb",
"used_in_bytes" : 1060506963968,
"free_percent" : 4,
"used_percent" : 96
}
},
"process" : {
"cpu" : {
"percent" : 217
},
"open_file_descriptors" : {
"min" : 554,
"max" : 2103,
"avg" : 1615
}
},
"jvm" : {
"max_uptime" : "37.2d",
"max_uptime_in_millis" : 3222214849,
"versions" : [
{
"version" : "1.8.0_212",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.212-b03",
"vm_vendor" : "AdoptOpenJDK",
"count" : 11
}
],
"mem" : {
"heap_used" : "105gb",
"heap_used_in_bytes" : 112805252816,
"heap_max" : "244.2gb",
"heap_max_in_bytes" : 262254886912
},
"threads" : 3045
},
"fs" : {
"total" : "14.2tb",
"total_in_bytes" : 15668896378880,
"free" : "12.4tb",
"free_in_bytes" : 13656508588032,
"available" : "12.4tb",
"available_in_bytes" : 13656508588032,
"spins" : "true"
},
"plugins" : [
{
"name" : "elasticsearch-monitoring",
"version" : "5.1.1",
"description" : "Elasticsearch Monitoring Plugin",
"classname" : "com.qualys.elasticsearch.plugins.monitoring.MonitoringPlugin"
},
{
"name" : "elasticsearch-http-basic",
"version" : "5.1.1",
"description" : "Elasticsearch Http Basic Auth Plugin",
"classname" : "com.qualys.elasticsearch.plugins.http.HttpBasicAuthPlugin"
}
],
"network_types" : {
"transport_types" : {
"netty4" : 11
},
"http_types" : {
"netty4" : 11
}
}
}
}

AKSHAY_ARDAK · November 1, 2019, 11:47am

@Christian_Dahlqvist @DavidTurner could you please help ?

Christian_Dahlqvist · November 1, 2019, 11:54am

I do not see anything obviously wrong but you are running an old version together with some plugins I have not seen before, so am not sure to what extent that could contribute.

DavidTurner · November 1, 2019, 12:21pm

Please don't ping me like that. I'm not paying attention to this thread, at least partly because it's about a version that is so far past the end of its supported life.

AKSHAY_ARDAK · November 1, 2019, 1:15pm

couple of data nodes went out of cluster now with below error :

2019-11-01T12:58:08,965 eric][T#423] [W] res.suppressed - [UID=] - path: /_cluster/health, params: {pretty=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null

org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.cache.clear.TransportClearIndicesCacheAction.checkGlobalBlock(TransportClearIndicesCacheAction.java:133) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.admin.indices.cache.clear.TransportClearIndicesCacheAction.checkGlobalBlock(TransportClearIndicesCacheAction.java:48) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.(TransportBroadcastByNodeAction.java:256) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:234) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction.doExecute(TransportBroadcastByNodeAction.java:79) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:173) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.ingest.IngestProxyActionFilter.apply(IngestProxyActionFilter.java:79) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:171) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:145) ~[elasticsearch-5.1.1.jar:5.1.1]

Christian_Dahlqvist · November 1, 2019, 1:16pm

I would recommend uninstalling the custom plugins to see how that affects the situation.

AKSHAY_ARDAK · November 1, 2019, 1:17pm

Acknowledged !! Thanks

AKSHAY_ARDAK · November 1, 2019, 1:25pm

Sure I will do that .Also do you suspect the issue with capacity of elsmaster/data nodes ?

Specs :
Master nodes (each)
RAM - 8G
HEAPSIZE - 3.9G
CPU - 4

Data node (each)
RAM - 128 G
HEAPSIZE - 30G
CPU - 32

Topic		Replies	Views
ES goes to red when node restarts Elasticsearch	10	474	September 6, 2019
ES cluster is red after restart Elasticsearch	2	493	July 6, 2017
Mysterious "red" cluster status has happened ~4x now Elasticsearch	1	301	July 6, 2017
Cluster turns to red after reboot Elasticsearch	29	2784	January 4, 2019
Cluster goes into red, some shards in initializing state Elasticsearch	8	1958	July 5, 2017

ES cluster goes into red frequently

Related topics