ES Cluster behavior has abnormal

Hello @Christian_Dahlqvist

Dear Christian,
In our production cluster, some of the Indices looks abnormal, as if first it is showing GREEN state, then after sometimes it is showing RED state. How it is possible,

ex.
green open imiconnect_inf_mo-2020-02-19 mp0-ITIHSg-1fbcP8aSzIQ 1 1 433229 0 455mb 229.1mb

red open imiconnect_inf_mo-2020-02-19 mp0-ITIHSg-1fbcP8aSzIQ 1 1

If I delete that index, then some indices went to RED state from GREEN.

I am fully confusing about this, could please help us.

Please do not ping people not already involved in the thread. This forum is manned by volunteers.

I would guess that you have a problem with either cluster configuration and/or the underlying hardware/storage.

How large is you cluster? What type of hardware is it deployed on? What type of storage are you using? Which Elasticsearch version are you using? How are the nodes configured? Are there any error messages or clues in the logs?

1 Like

Hello Christian

Thanks for your quick reply,
Kindly check these below info.
How large is you cluster?

We having 7 node cluster, in which 2 Coord, 3 Master, 2 Datanodes,

{
"cluster_name": "IMIConnectProductionCluster",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 2,
"active_primary_shards": 1349,
"active_shards": 2698,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 120,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 95.7416607523066
}

What type of hardware is it deployed on?

What type of storage are you using?

Which Elasticsearch version are you using?
ES Version : 5.6.4

How are the nodes configured?
ES Cluster have 2 Coord, 3 Master, 2 Datanodes

Are there any error messages or clues in the logs?
org.elasticsearch.transport.RemoteTransportException: [es-master-1][10.0.123.137:9300][indices:admin/delete]
Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (delete-index [[imiconnect_inf_mo-2020-02-19/mp0-ITIHSg-1fbcP8aSzIQ]]) within 30s
at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.6.4.jar:5.6.4]

Hello Christian

We have deployed ES cluster on two data centre.

Data centre-A Data centre-B
Coord-1 Coord-2
Master-1, Master-2 Master-3
Datanode-1 Datanode-2

What does the Elasticsearch.yml file for the master nodes look like? How far apart are the two data centres? What is the latency between them? What type of hardware and storage is used?

Elasticsearch.yml
cluster.name: "IMIConnectProductionCluster"
node.name: "es-master-1"
node.master: true
node.data: false
path.data: "/apps/ES/data"
path.logs: "/apps/ES/logs/"
discovery.zen.ping.unicast.hosts: ["localhost-1","localhost-2","localhost-3"]
discovery.zen.minimum_master_nodes: 2
network.host: localhost-1
http.port: 9200
gateway.recover_after_nodes: 1
bootstrap.system_call_filter: false

There is no latency between them ,
-sh-4.1$ ping 192.168.67.24
PING 192.168.67.24 (192.168.67.24) 56(84) bytes of data.
64 bytes from 192.168.67.24: icmp_seq=1 ttl=61 time=2.44 ms
64 bytes from 192.168.67.24: icmp_seq=2 ttl=61 time=2.45 ms

Hardware:
-sh-4.1$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 37
Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Stepping: 1
CPU MHz: 2700.000
BogoMIPS: 5400.00
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0,1

Is there anything in the logs of the data nodes around the index that went to red status?

Master node logs::
[2020-03-20T07:58:26,662][WARN ][o.e.c.a.s.ShardStateAction] [es-master-1] [imiconnect_voice_trans-2020-03-20][0] received shard failed for shard id [[imiconnect_voice_trans-2020-03-20][0]], allocation id [Z9xZfm09R1SMND0vui3EIw], primary term [1], message [mark copy as stale]

Data node logs:
Caused by: java.lang.IllegalStateException: try to recover [imiconnect_inf_mt-2020-03-19][0] from primary shard with sync id but number of docs differ: 508203 (es-data-2, primary) vs 508181(es-data-1)

Master logs:

[2020-03-20T08:42:28,627][WARN ][o.e.c.s.ClusterService ] [es-master-1] cluster state update task [put-mapping[ump_cs_agg_hr-2020-03-20-08]] took [36.2s] above the warn threshold of 30s
[2020-03-20T08:42:41,842][INFO ][o.e.m.j.JvmGcMonitorService] [es-master-1] [gc][94927] overhead, spent [350ms] collecting in the last [1s]
[2020-03-20T08:43:12,998][INFO ][o.e.c.m.MetaDataMappingService] [es-master-1] [ump_notifications_agg_hr/x-gGZJPoRouuSRAVKj6cUg] create_mapping [ump_notifications_agg_hr-2020-03-20-08]
[2020-03-20T08:43:13,757][WARN ][o.e.c.s.ClusterService ] [es-master-1] cluster state update task [put-mapping[ump_notifications_agg_hr-2020-03-20-08]] took [33.7s] above the warn threshold of 30s

Hello @Christian_Dahlqvist, @DavidTurner

I was deleted index (imiconnect_apnp_trans_log-2020-02-26), because this index was in RED state.

Then this index (imiconnect_chat_messages_logs-2020-01-30) comes to RED state, earlier this was in GREEN state. What happening can you please tell us?

This is very urgent, production issue , please help me to fix this.

Kindly check this logs.

[2020-03-20T08:55:38,463][INFO ][o.e.c.m.MetaDataDeleteIndexService] [es-master-1] [imiconnect_apnp_trans_log-2020-02-26/CVNriLFrRt6jEe7V7YPOVQ] deleting index

[2020-03-20T08:55:44,211][INFO ][o.e.g.LocalAllocateDangledIndices] [es-master-1] auto importing dangled indices [[imiconnect_chat_messages_logs-2020-01-30/N4vcNm-mQy6ZWaBjxPfGvg]/OPEN] from [{es-coord-1}{ShbcqaRsSLKbJTNPybfRXA}{Qou2A3JbT-KcuxAHo8_v6Q}{10.0.123.136}{10.0.123.136:9300}]

As I stated earlier this forum is manned by volunteers. Do not ping people not already involved in the thread. This also means there are no SLAs and not even a guarantee of response.

I would recommend the following:

  • Make sure your VMs are not overprovisioned so Elasticsearch has access to the cores allocated
  • Make sure your VMs do not use memory ballooning so Elasticsearch actually has access to the RAM it thinks it has
  • your cluster state updates seem to be slow, which could be caused by the above factors. You also have more shards than recommended, so I would recommend reducing this
1 Like

When it'll comes to 100%, It's Means Cluster comes to GREEN state.

{
"cluster_name": "IMIConnectProductionCluster",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 2,
"active_primary_shards": 1349,
"active_shards": 2698,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 120,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 95.7416607523066
}

should i change any configuration other than VMs settings??

I do not know what the cause is yet, so can not tell.

Earlier we have two data nodes on zone-A,
Then we have added another node on Zone-B.
After that we released the second node from Zone-A.

That time there is no issue with cluster. Health is green
From last one week onwards this issue was happening.

Can you guess it on this.

And why if I delete one index, then immediately another index comes to RED state????
I'm very shocking with this situations.

You are also running an old version, so I would recommend upgrading as newer versions have improved resiliency significantly.

What you are seeing is not normal, but I am not sure what is wrong with your cluster setup.

Hi,
i observed on both datanode indices directory , file permissions have different.
is this okay??
or else any mistake we did?

Datanode-1

Datanode-2
image

Can you please help us on this.

If i delete this (index-2020-03-17), then immediately this (index-2020-03-01) is automatically creating with RED state

Few indices are automatically creating with RED state which is not present in cluster earlier.

LOGS :

[2020-03-20T13:38:49,613][INFO ][o.e.c.m.MetaDataDeleteIndexService] [es-master-1] [imiconnect_chat_messages_logs-2020-01-29/B7vyDb0zSiWbnB9ZrXwQGg] deleting index

[2020-03-20T13:38:50,662][INFO ][o.e.g.LocalAllocateDangledIndices] [es-master-1] auto
importing dangled indices [[imiconnect_chat_messages_logs-2020-02-02/2WTBuIS6RNWY3tTujzoTtA]/OPEN] from [{es-coord-1}{ShbcqaRsSLKbJTNPybfRXA}{Qou2A3JbT-KcuxAHo8_v6Q}{10.0.123.136}{10.0.123.136:9300}]

Any Guess on this,

That does look suspect.

Hi,
i observed on both datanode indices directory , file permissions have different.
is this okay??
or else any mistake we did?

Datanode-1

Datenode-2
image

Can you please help us..