At least one primary shard for the index [.security-7] is unavailable issue

Hi
I have a two-node cluster with IP "0.0.0.1" , "0.0.0.2". One of my VMs "0.0.0.2" suddenly stopped, and when I start it, the cluster health was RED. Then I restart both VMs again and below message has been found in their log and I could not login in https://0.0.0.1:9200 and https://0.0.0.2:9200.

[2022-10-08T08:40:07,003][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [node-1] collector [cluster_stats] failed to collect data
org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-7] is unavailable
	at org.elasticsearch.xpack.security.support.SecurityIndexManager.getUnavailableReason(SecurityIndexManager.java:147) ~[?:?]
	at org.elasticsearch.xpack.security.authc.esnative.NativeUsersStore.getUserCount(NativeUsersStore.java:167) ~[?:?]
	at org.elasticsearch.xpack.security.authc.esnative.NativeRealm.lambda$usageStats$1(NativeRealm.java:56) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) ~[elasticsearch-7.16.1.jar:7.16.1]
	at org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.lambda$usageStats$5(CachingUsernamePasswordRealm.java:249) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136) ~[elasticsearch-7.16.1.jar:7.16.1]
	at org.elasticsearch.xpack.core.security.authc.Realm.usageStats(Realm.java:140) ~[?:?]
	at org.elasticsearch.xpack.security.authc.support.CachingUsernamePasswordRealm.usageStats(CachingUsernamePasswordRealm.java:247) ~[?:?]
	at org.elasticsearch.xpack.security.authc.esnative.NativeRealm.usageStats(NativeRealm.java:56) ~[?:?]
	at org.elasticsearch.xpack.security.authc.Realms.usageStats(Realms.java:388) ~[?:?]
	at org.elasticsearch.xpack.security.SecurityFeatureSet.usage(SecurityFeatureSet.java:165) ~[?:?]
	at org.elasticsearch.xpack.core.action.TransportXPackUsageAction.lambda$masterOperation$2(TransportXPackUsageAction.java:86) ~[?:?]
	at org.elasticsearch.xpack.core.common.IteratingActionListener.onResponse(IteratingActionListener.java:135) ~[?:?]
	at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-7.16.1.jar:7.16.1]
	at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-7.16.1.jar:7.16.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.16.1.jar:7.16.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.16.1.jar:7.16.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Then based on this link I did below steps to resolve issue:

1- define a new user
elasticsearch-users useradd restore_user -p xxxxxxx -r superuser
2- delete corrupt index:

curl -u restore_user -k -X DELETE "https://localhost:9200/.security-*"

3- restart all nodes

when I did these steps, I was able to login to elasticsearch node which I defined new user, by new user. but all previous roles and users have been vanished and I had to define them manauly agarin.
How can I handle this issue without the need of defining users and roles again?
also, the cluster health is RED and there are two unassigned shard in kibana monitoring but in the indices part, the status of all indices are green.

Regards

When you delete the index you delete all the existing users and roles.

We would need to figure out why your nodes "suddenly stopped" to try to understand what caused the index to be unrecoverable. Sharing some more logs would help.

Hi, thanks for your answer. The VM has been stopped so the elastic node stopped too.
how can we prevent such disastrous?

How were the hosts stopped exactly? Did Elasticsearch have the chance to gracefully shutdown?

actually the VM which the software group gave us was just for a limited time and after that time the VM automatically stopped but elasticsearch has been defined as windows service so it is expected that stopped correctly.
What is the best way to stop Elasticsearch to prevent this problem. And, when this issue happened, is there any way to resolve above error except deleting security index?

Also, is there a way to make a daily backup of roles and users and when this issue happened just insert the backup to inhibit the definition of users and roles manually ?

Hi,
I will be so appreciated if you consider my comments. Thanks.

You can take snapshots of your indices and then restore them, yes.

I mean taking backup of defined roles and users, not indices. Can we take backup of roles and users?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.