Index rollover broken after connectivity issues between nodes in a 3 node ELK cluster

Hi,

I have a 3 node ELK cluster and I have curator scheduled to run every hour using cronjob to rollover the indices.

I have run into this issue where a new index is created by the curator packetbeat-6.2.2-2018.08.06-000210 but the logs continue to go to the previous index packetbeat-6.2.2-2018.08.06-000209 and the curator rollover action
fails saying the index already exists. The index packetbeat-6.2.2-2018.08.06-000209 has grown huge as you can see below and elasticsearch is taking up lot of RAM and CPU because of this.

health status              index                         uuid            pri rep docs.count docs.deleted store.size pri.store.size
green  open   packetbeat-6.2.2-2018.08.06-000207 Kg0sk22zQ3SmdwCi2naKTA   3   1    2593120            0    869.2mb        434.6mb
green  open   packetbeat-6.2.2-2018.08.06-000205 x2BGrh9KR-i3mD9CKA6RZw   3   1    2479330            0    833.1mb        416.5mb
green  open   packetbeat-6.2.2-2018.08.06-000203 VtDdQH1mQEG12HrkoMVHLQ   3   1     809963            0    281.1mb        140.5mb
green  open   packetbeat-6.2.2-2018.08.06-000208 uqxkpEf3T36IdvUZ6H9h9w   3   1     282274            0    99.7mb         49.8mb
green  open   packetbeat-6.2.2-2018.08.06-000202 Vrc0eBvORTqMfbVXWDqi7g   3   1    1541105            0    549mb        274.5mb
green  open   packetbeat-6.2.2-2018.08.06-000209 Iiaj9ZeHRC6lW0t5Hav1vQ   3   1  382013427            0    146.3gb         71.8gb
green  open   packetbeat-6.2.2-2018.08.06-000210 3czhBrAITG6LWcpF2BvCYg   3   1          0            0    1.5kb           785b
green  open   packetbeat-6.2.2-2018.08.06-000206 UgpKLWZyRv2PWjsKe8ZYuw   3   1    2416637            0    809.4mb        404.7mb
green  open   packetbeat-6.2.2-2018.08.06-000204 BWUu21OfQSCw5NusYgQtYQ   3   1    2289579            0    765.1mb        382.5mb

There were some connectivity issues between the nodes around the time index packetbeat-6.2.2-2018.08.06-000210 was getting created. I can see the below error when the index 210 got created.

[2018-08-15T16:01:00,060][INFO ][o.e.c.m.MetaDataCreateIndexService] [elite10dot78] [packetbeat-6.2.2-2018.08.06-000210] creating index, cause [rollover_index], 
templates [packetbeat-6.2.2], shards [3]/[1], mappings [doc]
[2018-08-15T16:01:38,806][WARN ][r.suppressed             ] path: /packetbeat_logs/_rollover, params: {wait_for_active_shards=0, index=packetbeat_logs, dry_run=false}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (index-aliases) within 30s
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:125) ~[elasticsearch-6.2.2.jar:6.2.2]
        at java.util.ArrayList.forEach(ArrayList.java:1257) ~[?:1.8.0_171]
        at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:124) ~[elasticsearch-6.2.2.jar:6.2.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.2.jar:6.2.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_171]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
[2018-08-15T16:01:38,828][WARN ][o.e.d.z.PublishClusterStateAction] [elite10dot78] timed out waiting for all nodes to process published state [3315] 
(timeout [30s], pending nodes: [{elite10dot76}{CGAWYwb1T0ey7NWL_88jsg}{UWhfKMiLQpiPWiBaPA6mJQ}{elite10dot76.elite35.lab}{192.168.10.76:9400}])
[2018-08-15T16:01:40,781][WARN ][o.e.c.s.MasterService    ] [elite10dot78] cluster state update task [shard-started shard id 
[[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [Cl0lPccmRJW1w26k2F-XDQ], primary term [0], message [after new shard recovery]
[shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [Cl0lPccmRJW1w26k2F-XDQ], primary term [0], message [after new shard recovery]]] took 
[31.9s] above the warn threshold of 30s

Some more issues after few minutes.

[2018-08-15T16:08:30,064][WARN ][o.e.c.s.MasterService    ] [elite10dot78] cluster state update task [shard-started shard id 
[[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [master {elite10dot78}{AR-RBd2HRkWxWUtmSZr7CQ}
{8FyiBT55SlK1IXac4c_INw}{elite10dot78.elite35.lab}{192.168.10.78:9400} marked shard as initializing, but shard state is [POST_RECOVERY], mark shard as started]
[shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message 
[master {elite10dot78}{AR-RBd2HRkWxWUtmSZr7CQ}{8FyiBT55SlK1IXac4c_INw}{elite10dot78.elite35.lab}{192.168.10.78:9400} marked shard as initializing, 
but shard state is [POST_RECOVERY], mark shard as started]], shard-started shard id [[packetbeat-6.2.2-2018.08.06-000210][0]], 
allocation id [d7CD5AeuQcSNDoMOmKG0-Q], primary term [0], message [after peer recovery][shard id [[packetbeat-6.2.2-2018.08.06-000210][0]], 
allocation id [d7CD5AeuQcSNDoMOmKG0-Q], primary term [0], message [after peer recovery]], shard-started shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], 
allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [after peer recovery][shard id [[packetbeat-6.2.2-2018.08.06-000210][2]], 
allocation id [yXL1puj1QpWjujWhoCFNWg], primary term [0], message [after peer recovery]]] took [30.1s] above the warn threshold of 30s

Since then I'm seeing the following error from curator whenever it tries to run a rollover action every hour.

2018-08-15 18:00:08,002 INFO      Trying Action ID: 7, "rollover": Rollover the index associated with index 'name', which should be in the form of prefix-YYYY.MM.DD-1.
2018-08-15 18:00:08,013 INFO      Performing index rollover
2018-08-15 18:00:08,038 ERROR     Failed to complete action: rollover.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(400, 'resource_already_exists_exception', 'index [packetbeat-6.2.2-2018.08.06-000210/3czhBrAITG6LWcpF2BvCYg] already exists')

I believe it is trying to create the index packetbeat-6.2.2-2018.08.06-000210 again because the alias is pointing to packetbeat-6.2.2-2018.08.06-000209 but I don't understand why the alias is still pointing to the previous index.

Does any of you have any idea what is going wrong here?

That sounds odd. How many of your nodes are master-eligible? What do you have minimum_master_nodes set to in your configuration?

All the 3 nodes are master-eligible and minimum_master_nodes is set to 2.

Hi,

Any ideas how I can prevent my cluster from getting into this situation again?

Hi,

I'm still waiting to hear from you experts on what could have caused this issue and how I can prevent my cluster from getting into this state again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.