Hi,
I have a 3 node ELK cluster and I have curator scheduled to run every hour using cronjob to rollover the indices.
I have run into this issue where a new index is created by the curator packetbeat-6.2.2-2018.08.06-000210 but the logs continue to go to the previous index packetbeat-6.2.2-2018.08.06-000209 and the curator rollover action
fails saying the index already exists. The index packetbeat-6.2.2-2018.08.06-000209 has grown huge as you can see below and elasticsearch is taking up lot of RAM and CPU because of this.
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open packetbeat-6.2.2-2018.08.06-000207 Kg0sk22zQ3SmdwCi2naKTA 3 1 2593120 0 869.2mb 434.6mb
green open packetbeat-6.2.2-2018.08.06-000205 x2BGrh9KR-i3mD9CKA6RZw 3 1 2479330 0 833.1mb 416.5mb
green open packetbeat-6.2.2-2018.08.06-000203 VtDdQH1mQEG12HrkoMVHLQ 3 1 809963 0 281.1mb 140.5mb
green open packetbeat-6.2.2-2018.08.06-000208 uqxkpEf3T36IdvUZ6H9h9w 3 1 282274 0 99.7mb 49.8mb
green open packetbeat-6.2.2-2018.08.06-000202 Vrc0eBvORTqMfbVXWDqi7g 3 1 1541105 0 549mb 274.5mb
green open packetbeat-6.2.2-2018.08.06-000209 Iiaj9ZeHRC6lW0t5Hav1vQ 3 1 382013427 0 146.3gb 71.8gb
green open packetbeat-6.2.2-2018.08.06-000210 3czhBrAITG6LWcpF2BvCYg 3 1 0 0 1.5kb 785b
green open packetbeat-6.2.2-2018.08.06-000206 UgpKLWZyRv2PWjsKe8ZYuw 3 1 2416637 0 809.4mb 404.7mb
green open packetbeat-6.2.2-2018.08.06-000204 BWUu21OfQSCw5NusYgQtYQ 3 1 2289579 0 765.1mb 382.5mb
There were some connectivity issues between the nodes around the time index packetbeat-6.2.2-2018.08.06-000210 was getting created. I can see the below error when the index 210 got created.
[2018-08-15T16:01:00,060][INFO ][o.e.c.m.MetaDataCreateIndexService] [elite10dot78] [packetbeat-6.2.2-2018.08.06-000210] creating index, cause [rollover_index],
templates [packetbeat-6.2.2], shards [3]/[1], mappings [doc]
[2018-08-15T16:01:38,806][WARN ][r.suppressed ] path: /packetbeat_logs/_rollover, params: {wait_for_active_shards=0, index=packetbeat_logs, dry_run=false}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (index-aliases) within 30s
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:125) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_171]
at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:124) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
[2018-08-15T16:01:38,828][WARN ][o.e.d.z.PublishClusterStateAction] [elite10dot78] timed out waiting for all nodes to process published state [3315]
(timeout [30s], pending nodes: [{elite10dot76}{CGAWYwb1T0ey7NWL_88jsg}{UWhfKMiLQpiPWiBaPA6mJQ}{elite10dot76.elite35.lab}{192.168.10.76:9400}])
[2018-08-15T16:01:40,781][WARN ][o.e.c.s.MasterService ] [elite10dot78] cluster state update task [shard-started shard id
[[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [Cl0lPccmRJW1w26k2F-XDQ], primary term [0], message [after new shard recovery]
[shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [Cl0lPccmRJW1w26k2F-XDQ], primary term [0], message [after new shard recovery]]] took
[31.9s] above the warn threshold of 30s
Some more issues after few minutes.
[2018-08-15T16:08:30,064][WARN ][o.e.c.s.MasterService ] [elite10dot78] cluster state update task [shard-started shard id
[[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [master {elite10dot78}{AR-RBd2HRkWxWUtmSZr7CQ}
{8FyiBT55SlK1IXac4c_INw}{elite10dot78.elite35.lab}{192.168.10.78:9400} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]
[shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message
[master {elite10dot78}{AR-RBd2HRkWxWUtmSZr7CQ}{8FyiBT55SlK1IXac4c_INw}{elite10dot78.elite35.lab}{192.168.10.78:9400} marked shard as initializing,
but shard state is [POST_RECOVERY], mark shard as started]], shard-started shard id [[packetbeat-6.2.2-2018.08.06-000210][0]],
allocation id [d7CD5AeuQcSNDoMOmKG0-Q], primary term [0], message [after peer recovery][shard id [[packetbeat-6.2.2-2018.08.06-000210][0]],
allocation id [d7CD5AeuQcSNDoMOmKG0-Q], primary term [0], message [after peer recovery]], shard-started shard id [[packetbeat-6.2.2-2018.08.06-000210][2]],
allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [after peer recovery][shard id [[packetbeat-6.2.2-2018.08.06-000210][2]],
allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [after peer recovery]]] took [30.1s] above the warn threshold of 30s
Since then I'm seeing the following error from curator whenever it tries to run a rollover action every hour.
2018-08-15 18:00:08,002 INFO Trying Action ID: 7, "rollover": Rollover the index associated with index 'name', which should be in the form of prefix-YYYY.MM.DD-1.
2018-08-15 18:00:08,013 INFO Performing index rollover
2018-08-15 18:00:08,038 ERROR Failed to complete action: rollover. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(400, 'resource_already_exists_exception', 'index [packetbeat-6.2.2-2018.08.06-000210/3czhBrAITG6LWcpF2BvCYg] already exists')
I believe it is trying to create the index packetbeat-6.2.2-2018.08.06-000210 again because the alias is pointing to packetbeat-6.2.2-2018.08.06-000209 but I don't understand why the alias is still pointing to the previous index.
Does any of you have any idea what is going wrong here?