29th July INDEX is in UNASSIGNED status and it didn't create for 30th and 31st July

myapp-2025.07.29 is in UNASSIGNED status
This index is not created for last two days is there a way to fix without deleteing this messed up index.

[root@elkhost5 ~]# grep -i translog /var/log/elk/es.log
[2025-07-31T07:01:08,855][WARN ][o.e.i.e.Engine           ] [lkhost5] [myapp-2025.07.29][0] failed engine [failed to recover from translog]
org.elasticsearch.index.engine.EngineException: failed to recover from translog
	at org.elasticsearch.index.engine.InternalEngine.lambda$recoverFromTranslogInternal$6(InternalEngine.java:614) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslogInternal(InternalEngine.java:607) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$recoverFromTranslog$3(InternalEngine.java:584) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:580) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2100) ~[elasticsearch-8.17.5.jar:?]
Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1990) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.shard.IndexShard.runTranslogRecovery(IndexShard.java:2038) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.shard.IndexShard.lambda$openEngineAndRecoverFromTranslog$24(IndexShard.java:2091) ~[elasticsearch-8.17.5.jar:?]
	at org.elasticsearch.index.engine.InternalEngine.lambda$recoverFromTranslogInternal$6(InternalEngine.java:612) ~[elasticsearch-8.17.5.jar:?]

Index Status

[root@elkhost5 ~]# elk _cat/shards/*myapp-2025.07.27*
awsmyapp-2025.07.27 0 p STARTED 187541028 114.4gb 114.4gb 192.168.56.108 elkhost3
awsmyapp-2025.07.27 0 r STARTED 187541028 115.3gb 115.3gb 192.168.56.107 elkhost2
myapp-2025.07.27    0 r STARTED 576127403 757.5gb 757.5gb 192.168.56.107 elkhost2
myapp-2025.07.27    0 p STARTED 576127403 756.1gb 756.1gb 192.168.56.109 elkhost4
[root@elkhost5 ~]# elk _cat/shards/*myapp-2025.07.28*
awsmyapp-2025.07.28 0 r STARTED 277446638 190.1gb 190.1gb 192.168.56.107 elkhost2
awsmyapp-2025.07.28 0 p STARTED 277446638 190.1gb 190.1gb 192.168.56.109 elkhost4
myapp-2025.07.28    0 p STARTED 824961117 870.6gb 870.6gb 192.168.56.105 elkhost0
myapp-2025.07.28    0 r STARTED 824961117 870.6gb 870.6gb 192.168.56.107 elkhost2
[root@elkhost5 ~]# elk _cat/shards/*myapp-2025.07.29*
awsmyapp-2025.07.29 0 r RELOCATING   270916495 196.1gb 196.1gb 192.168.56.110 elkhost5 -> 192.168.56.105 4QVIH2DgSxuAh7ChBazYoQ elkhost0
awsmyapp-2025.07.29 0 p STARTED      270916495 196.2gb 196.2gb 192.168.56.108 elkhost3
myapp-2025.07.29    0 p INITIALIZING                           192.168.56.110 elkhost5
myapp-2025.07.29    0 r UNASSIGNED
[root@elkhost5 ~]# elk _cat/shards/*myapp-2025.07.30*
awsmyapp-2025.07.30 0 r STARTED 279517905 184.1gb 184.1gb 192.168.56.110 elkhost5
awsmyapp-2025.07.30 0 p STARTED 279517905 184.2gb 184.2gb 192.168.56.107 elkhost2
[root@elkhost5 ~]# elk _cat/shards/*myapp-2025.07.31*
awsmyapp-2025.07.31 0 r STARTED 2659744 2.1gb 2.1gb 192.168.56.107 elkhost2
awsmyapp-2025.07.31 0 p STARTED 2810538   2gb   2gb 192.168.56.109 elkhost4

Current space utilization

shards shards.undesired write_load.forecast disk.indices.forecast disk.indices disk.used disk.avail disk.total disk.percent host                     ip           node                     node.role
   184               43                 0.0                 7.1tb          8tb     8.6tb      1.7tb     10.3tb           83 elkhost5 192.168.56.110 elkhost5 cdfhilmrstw
   299               31                 0.0                 6.8tb        7.2tb     7.8tb      2.5tb     10.3tb           75 elkhost3 192.168.56.108 elkhost3 cdfhilmrstw
   324               73                 0.0                 7.9tb        7.9tb       8tb      2.3tb     10.3tb           77 elkhost4 192.168.56.109 elkhost4 cdfhilmrstw
   296              106                 0.0                 7.1tb        7.1tb     7.2tb      3.1tb     10.3tb           69 elkhost1 192.168.56.106 elkhost1 cdfhilmrstw
   326               47                 0.0                   6tb        6.2tb     6.2tb      4.1tb     10.3tb           60 elkhost0 192.168.56.105 elkhost0 cdfhilmrstw
   315               80                 0.0                 7.2tb        6.4tb     7.5tb      2.8tb     10.3tb           72 elkhost2 192.168.56.107 elkhost2 cdfhilmrstw


Health

1753885269 14:21:09 M0 red 6 6 1740 871 5 4 3 1 0 - 99.6%

Do you know why you are getting this message? Like, any idea at all?

I am not sure what is causing the issue, but have some questions and comments.

If you look in the logs, do you find any evidence that one or more nodes at some point have dropped out of the cluster around the time this problem started? Do you have any ongoing shard recoveries?

What type of hardware is this cluster deployed on, specifically what type of storage are you using?

It is generally recommended to aim for a shard size of up to 50GB or so as that gives shards that are reasonably efficient to query and relocate if the cluster experiences issues. You have some shards that are way above the recommended threshold, which is could potentially be causing problems. Note that this is the same type of index as you are having problems with. I would recommend increasing the number of primary shards for the indices you listed in order to bring down the average shard size to around the recommended level.

It's there in the stack trace:

CurrentState[CLOSED] operation only allowed when recovering, origin [LOCAL_TRANSLOG_RECOVERY]

In other words this shard got some distance through recovery and then stopped. There will be other log messages explaining why.

That’s not really what I meant by the question, which was admittedly poorly worded, and @Christian_Dahlqvist asked a better worded variant in meantime.

My experience on these forums is there is often context that is not shared with the initial query. Not usually due to any intention to mislead or hide information, just people aren’t sure what’s relevant or would be helpful.

In this case, was there maybe a sysadmin error, a maintenance activity, a power outage, … that either preceded or was ongoing when troubles hit.

1 Like

Right yeah exactly but all of this should become clear once we see some more logs.

Some troubleshooters just want the raw logs, nothing but the logs. They stare into a swirling void of timestamps and stack traces and somehow divine the truth. These are the elite. The Jedi. The Sherlocks. The chosen few who speak fluent Logs-ish, and Java, and quite possibly Esperanto.

I am, as should be obvious, not one of them.

Then there are the other troubleshooters, more narrative types. We like eyewitness reports in plain English. Ideally with no words over two syllables. A diagram helps. That’s me. We can still help too.

Both types can hopefully coexist, maybe uneasily!

@Sangram_Dash — please, release the logs. Or tell your story.

1 Like