ES Cluster behavior has abnormal

AshokBachchan · March 20, 2020, 6:03am

Dear Christian,
In our production cluster, some of the Indices looks abnormal, as if first it is showing GREEN state, then after sometimes it is showing RED state. How it is possible,

ex.
green open imiconnect_inf_mo-2020-02-19 mp0-ITIHSg-1fbcP8aSzIQ 1 1 433229 0 455mb 229.1mb

red open imiconnect_inf_mo-2020-02-19 mp0-ITIHSg-1fbcP8aSzIQ 1 1

If I delete that index, then some indices went to RED state from GREEN.

I am fully confusing about this, could please help us.

Christian_Dahlqvist · March 20, 2020, 6:10am

Please do not ping people not already involved in the thread. This forum is manned by volunteers.

I would guess that you have a problem with either cluster configuration and/or the underlying hardware/storage.

How large is you cluster? What type of hardware is it deployed on? What type of storage are you using? Which Elasticsearch version are you using? How are the nodes configured? Are there any error messages or clues in the logs?

AshokBachchan · March 20, 2020, 6:33am

Hello Christian

Thanks for your quick reply,
Kindly check these below info.
How large is you cluster?

We having 7 node cluster, in which 2 Coord, 3 Master, 2 Datanodes,

{
"cluster_name": "IMIConnectProductionCluster",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 2,
"active_primary_shards": 1349,
"active_shards": 2698,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 120,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 95.7416607523066
}

What type of hardware is it deployed on?

What type of storage are you using?

Which Elasticsearch version are you using?
ES Version : 5.6.4

How are the nodes configured?
ES Cluster have 2 Coord, 3 Master, 2 Datanodes

Are there any error messages or clues in the logs?
org.elasticsearch.transport.RemoteTransportException: [es-master-1][10.0.123.137:9300][indices:admin/delete]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-index [[imiconnect_inf_mo-2020-02-19/mp0-ITIHSg-1fbcP8aSzIQ]]) within 30s
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.4.jar:5.6.4]

AshokBachchan · March 20, 2020, 6:37am

Hello Christian

We have deployed ES cluster on two data centre.

Data centre-A	Data centre-B
Coord-1	Coord-2
Master-1, Master-2	Master-3
Datanode-1	Datanode-2

Christian_Dahlqvist · March 20, 2020, 7:32am

What does the Elasticsearch.yml file for the master nodes look like? How far apart are the two data centres? What is the latency between them? What type of hardware and storage is used?

AshokBachchan · March 20, 2020, 7:44am

Elasticsearch.yml
cluster.name: "IMIConnectProductionCluster"
node.name: "es-master-1"
node.master: true
node.data: false
path.data: "/apps/ES/data"
path.logs: "/apps/ES/logs/"
discovery.zen.ping.unicast.hosts: ["localhost-1","localhost-2","localhost-3"]
discovery.zen.minimum_master_nodes: 2
network.host: localhost-1
http.port: 9200
gateway.recover_after_nodes: 1
bootstrap.system_call_filter: false

There is no latency between them ,
-sh-4.1$ ping 192.168.67.24
PING 192.168.67.24 (192.168.67.24) 56(84) bytes of data.
64 bytes from 192.168.67.24: icmp_seq=1 ttl=61 time=2.44 ms
64 bytes from 192.168.67.24: icmp_seq=2 ttl=61 time=2.45 ms

Hardware:
-sh-4.1$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Stepping: 1
CPU MHz: 2700.000
BogoMIPS: 5400.00
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,1

Christian_Dahlqvist · March 20, 2020, 7:58am

Is there anything in the logs of the data nodes around the index that went to red status?

AshokBachchan · March 20, 2020, 8:00am

Master node logs::
[2020-03-20T07:58:26,662][WARN ][o.e.c.a.s.ShardStateAction] [es-master-1] [imiconnect_voice_trans-2020-03-20][0] received shard failed for shard id [[imiconnect_voice_trans-2020-03-20][0]], allocation id [Z9xZfm09R1SMND0vui3EIw], primary term [1], message [mark copy as stale]

Data node logs:
Caused by: java.lang.IllegalStateException: try to recover [imiconnect_inf_mt-2020-03-19][0] from primary shard with sync id but number of docs differ: 508203 (es-data-2, primary) vs 508181(es-data-1)

AshokBachchan · March 20, 2020, 8:45am

Master logs:

[2020-03-20T08:42:28,627][WARN ][o.e.c.s.ClusterService ] [es-master-1] cluster state update task [put-mapping[ump_cs_agg_hr-2020-03-20-08]] took [36.2s] above the warn threshold of 30s
[2020-03-20T08:42:41,842][INFO ][o.e.m.j.JvmGcMonitorService] [es-master-1] [gc][94927] overhead, spent [350ms] collecting in the last [1s]
[2020-03-20T08:43:12,998][INFO ][o.e.c.m.MetaDataMappingService] [es-master-1] [ump_notifications_agg_hr/x-gGZJPoRouuSRAVKj6cUg] create_mapping [ump_notifications_agg_hr-2020-03-20-08]
[2020-03-20T08:43:13,757][WARN ][o.e.c.s.ClusterService ] [es-master-1] cluster state update task [put-mapping[ump_notifications_agg_hr-2020-03-20-08]] took [33.7s] above the warn threshold of 30s

AshokBachchan · March 20, 2020, 9:01am

Hello @Christian_Dahlqvist, @DavidTurner

I was deleted index (imiconnect_apnp_trans_log-2020-02-26), because this index was in RED state.

Then this index (imiconnect_chat_messages_logs-2020-01-30) comes to RED state, earlier this was in GREEN state. What happening can you please tell us?

This is very urgent, production issue , please help me to fix this.

Kindly check this logs.

[2020-03-20T08:55:38,463][INFO ][o.e.c.m.MetaDataDeleteIndexService] [es-master-1] [imiconnect_apnp_trans_log-2020-02-26/CVNriLFrRt6jEe7V7YPOVQ] deleting index

[2020-03-20T08:55:44,211][INFO ][o.e.g.LocalAllocateDangledIndices] [es-master-1] auto importing dangled indices [[imiconnect_chat_messages_logs-2020-01-30/N4vcNm-mQy6ZWaBjxPfGvg]/OPEN] from [{es-coord-1}{ShbcqaRsSLKbJTNPybfRXA}{Qou2A3JbT-KcuxAHo8_v6Q}{10.0.123.136}{10.0.123.136:9300}]

Christian_Dahlqvist · March 20, 2020, 9:26am

As I stated earlier this forum is manned by volunteers. Do not ping people not already involved in the thread. This also means there are no SLAs and not even a guarantee of response.

I would recommend the following:

Make sure your VMs are not overprovisioned so Elasticsearch has access to the cores allocated
Make sure your VMs do not use memory ballooning so Elasticsearch actually has access to the RAM it thinks it has
your cluster state updates seem to be slow, which could be caused by the above factors. You also have more shards than recommended, so I would recommend reducing this

AshokBachchan · March 20, 2020, 9:34am

When it'll comes to 100%, It's Means Cluster comes to GREEN state.

{
"cluster_name": "IMIConnectProductionCluster",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 2,
"active_primary_shards": 1349,
"active_shards": 2698,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 120,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 95.7416607523066
}

should i change any configuration other than VMs settings??

Christian_Dahlqvist · March 20, 2020, 10:14am

I do not know what the cause is yet, so can not tell.

AshokBachchan · March 20, 2020, 10:32am

Earlier we have two data nodes on zone-A,
Then we have added another node on Zone-B.
After that we released the second node from Zone-A.

That time there is no issue with cluster. Health is green
From last one week onwards this issue was happening.

Can you guess it on this.

And why if I delete one index, then immediately another index comes to RED state????
I'm very shocking with this situations.

Christian_Dahlqvist · March 20, 2020, 10:47am

You are also running an old version, so I would recommend upgrading as newer versions have improved resiliency significantly.

What you are seeing is not normal, but I am not sure what is wrong with your cluster setup.

AshokBachchan · March 20, 2020, 2:09pm

Hi,
i observed on both datanode indices directory , file permissions have different.
is this okay??
or else any mistake we did?

Datanode-1

Datanode-2

Can you please help us on this.

AshokBachchan · March 20, 2020, 2:22pm

If i delete this (index-2020-03-17), then immediately this (index-2020-03-01) is automatically creating with RED state

Few indices are automatically creating with RED state which is not present in cluster earlier.

LOGS :

[2020-03-20T13:38:49,613][INFO ][o.e.c.m.MetaDataDeleteIndexService] [es-master-1] [imiconnect_chat_messages_logs-2020-01-29/B7vyDb0zSiWbnB9ZrXwQGg] deleting index

[2020-03-20T13:38:50,662][INFO ][o.e.g.LocalAllocateDangledIndices] [es-master-1] auto
importing dangled indices [[imiconnect_chat_messages_logs-2020-02-02/2WTBuIS6RNWY3tTujzoTtA]/OPEN] from [{es-coord-1}{ShbcqaRsSLKbJTNPybfRXA}{Qou2A3JbT-KcuxAHo8_v6Q}{10.0.123.136}{10.0.123.136:9300}]

Any Guess on this,

Christian_Dahlqvist · March 20, 2020, 3:25pm

That does look suspect.

AshokBachchan · March 20, 2020, 4:36pm

Hi,
i observed on both datanode indices directory , file permissions have different.
is this okay??
or else any mistake we did?

Datanode-1

Datenode-2

Can you please help us..

system · April 17, 2020, 4:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mysterious "red" cluster status has happened ~4x now Elasticsearch	1	301	July 6, 2017
Elasticsearch Cluster status turns red while restoring an index Elasticsearch	5	523	April 11, 2017
Elasticsearch Cluster Status red and Incides status red without error and reason Elasticsearch	4	712	July 20, 2021
Cluster turns to red after reboot Elasticsearch	29	2768	January 4, 2019
Elasticsearch cluster health is fluctuating between yellow and red Elasticsearch	12	889	November 7, 2019

ES Cluster behavior has abnormal

Related topics