ILM turn unstable the cluster. Heap Overflow

when i try to execute:

curl -X POST "xx.xx.xxx.xx:9200/_ilm/stop?pretty

the result is:
"type" : "process_cluster_event_timeout_exception"

I think the ILM operation schedule generate an overflow heap error:

[2019-10-25T00:53:10,313][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [CGSS-CLUSTER01-MASTER-1] uncaught exception in thread [elasticsearch[CGSS-CL
USTER01-MASTER-1][management][T#1]]
java.lang.IllegalStateException: expected index [quorum01-20180202] with policy [time_close] to have current step consistent with provided step key (null)
but it was {"phase":"new","action":"init","name":"init"}
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleRunner.maybeRunAsyncAction(IndexLifecycleRunner.java:167) ~[?:?]
at org.elasticsearch.xpack.indexlifecycle.IndexLifecycleService.onMaster(IndexLifecycleService.java:123) ~[?:?]
at org.elasticsearch.cluster.service.ClusterApplierService$OnMasterRunnable.run(ClusterApplierService.java:616) ~[elasticsearch-6.6.0.jar:6.6.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) ~[elasticsearch-6.6.0.jar:6.6.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_31]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_31]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_31]
[2019-10-25T00:53:10,754][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [CGSS-CLUSTER01-MASTER-1] timed out while retrying [indices:admin/create] after fa
ilure (timeout [30s])
[2019-10-25T00:53:10,755][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [CGSS-CLUSTER01-MASTER-1] timed out while retrying [indices:admin/create] after fa
ilure (timeout [30s])

Any recommendation? Before i configure ILM, i use to run my own program in python to close old indexes using a time mask and i never had any problems with the cluster.

Finally i modified the index settings using curl:

curl -X PUT "**.**.***.***:9200/****-20191025/_settings?pretty" -H 'Content-Type: application/json' -d'{"index":{"lifecycle":{"name":""}}}'

Then i enable allocation for primary and then for all...
The index life cycle managment, used only configuration for cold phase, closing index older than 30 days generates a heap overflow. I saw on the master nodes the use of heap memory is not balanced, and the ilm process generate the overflow and kill master nodes until reach the limit established at the .yml config file.

Probably is not a very "clean" solution, but works... the cluster is now full operative.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.