Access Denied Exception with Shrink and Delete After using Curator


(Micah Hunsberger) #1

I'm using curator with Elasticsearch version 6.3.0 to do a shrink and delete after action on indices after they've been rolled over.

It appears to operate without errors, since the curator log does not show any errors, but if I look at the cluster logs, there is a warning log that it can't delete the index, and when I check the disk, the index folder is still there, so the space hasn't been cleared either.

[2018-11-21T12:09:28,605][WARN ][o.e.i.IndicesService     ] [ELK-ES-A] [logstash-syslog-2018.09.07-001/gv8gx3KhSKWTG6c6OmpH5g] failed to delete index
java.io.IOException: could not remove the following files (in the order of attempts):
   D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm.dim: java.nio.file.AccessDeniedException: D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm.dim
   D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm.fdt: java.nio.file.AccessDeniedException: D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm.fdt
   D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm_Lucene50_0.doc: java.nio.file.AccessDeniedException: D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm_Lucene50_0.doc
   D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm_Lucene50_0.pos: java.nio.file.AccessDeniedException: D:\ES\data\nodes\0\indices\gv8gx3KhSKWTG6c6OmpH5g\0\index\_1qvm_Lucene50_0.pos
...

and then later on it just logs the following every 10 seconds for about 30 minutes:

[2018-11-21T12:09:28,727][WARN ][o.e.i.IndicesService     ] [ELK-ES-A] [logstash-syslog-2018.09.07-001/gv8gx3KhSKWTG6c6OmpH5g] still pending deletes present for shards [[[logstash-syslog-2018.09.07-001/gv8gx3KhSKWTG6c6OmpH5g]], [[logstash-syslog-2018.09.07-001/gv8gx3KhSKWTG6c6OmpH5g]][0], [[logstash-syslog-2018.09.07-001/gv8gx3KhSKWTG6c6OmpH5g]][1]] - retrying

my curator action file looks like this:

  1:
    action: rollover
    description: >-
      Rollover logstash-syslog_write alias if it is bigger than 4gb or older than 28d
    options:
      name: logstash-syslog_write
      disable_action: False
      ignore_empty_list: True
      conditions:
        max_size: 4gb
        max_age: 28d
  2:
    action: index_settings
    description: >-
      Set number of replicas to 0 on rolled over logstash-* indices to prepare for shrinking
    options:
      disable_action: False
      ignore_empty_list: True
      index_settings:
        index:
          number_of_replicas: 0
    filters:
    - filtertype: alias
      aliases:
        - logstash-fw-syslog_write
        - logstash-syslog_write
        - logstash-vpn-syslog_write
      exclude: True
    - filtertype: pattern
      kind: prefix
      value: logstash-
    - filtertype: pattern
      kind: suffix
      value: -archive
      exclude: True
3:
    action: shrink
    description: >-
      Shrink rolled over logstash-* indices on ELK-ES-A.
      Delete each source index after successful shrink,
      then reroute the shrunk index with the provided parameters.
    options:
      disable_action: False
      ignore_empty_list: True
      shrink_node: ELK-ES-A
      node_filters:
        permit_masters: True
      number_of_shards: 1
      number_of_replicas: 0
      shrink_prefix: ''
      shrink_suffix: '-archive'
      delete_after: True
      post_allocation:
        allocation_type: require
        key: 'node_type'
        value: 'cold'
      wait_for_active_shards: all
      extra_settings:
        settings:
          index.codec: best_compression
          index.refresh_interval: 1m
      wait_for_completion: True
      wait_for_rebalance: True
      wait_interval: 9
      max_wait: -1
      timeout_override: 21600
    filters:
    - filtertype: alias
      aliases:
        - logstash-fw-syslog_write
        - logstash-syslog_write
        - logstash-vpn-syslog_write
      exclude: True
    - filtertype: pattern
      kind: prefix
      value: logstash-
    - filtertype: pattern
      kind: suffix
      value: -archive
      exclude: True

Elasticsearch is running as a System service on Windows, so the Access Denied exception cannot be from security permissions. Plus, I don't get this error if I just delete an index, it is only after shrinking that I get these logs.

Any help is greatly appreciated. Or, if this is solved by upgrading elasticsearch, I can do that as well.


(David Turner) #2

This is unfortunately a known issue with shrinking indices on Windows (and my word it was a pain to diagnose): https://github.com/elastic/elasticsearch/issues/33857#issuecomment-425438171

The shrunken copies of these indices are preventing the unshrunken ones from being deleted. The only workaround I can think of right now is to reallocate the shrunken shards elsewhere, which will allow the deletions to go ahead.


(Micah Hunsberger) #3

Wow, thanks for pointing that issue out. I had noticed this behavior for a while, but I thought it was just taking a while to delete the indices. It was only when the disk started filling up a lot faster that I realized it was actually failing to delete the unshrunken index.

So knowing this, would you say the curator file should set the shrink node to a separate node than the post allocation?


(David Turner) #4

I am no curator expert, but I am guessing that shrink_node doesn't make a lot of sense outside of a shrink action. However I've looked harder at your curator action file and I do note that you set this:

      post_allocation:
        allocation_type: require
        key: 'node_type'
        value: 'cold'

Assuming that ELK-ES-A is not a cold node, does this not have the effect of moving the shrunken shards away? If it does, and there are no copies of this data (either shrunken or otherwise) left on the node then we might have to dig deeper.


(Micah Hunsberger) #5

Assuming that ELK-ES-A is not a cold node does this not have the effect of moving the shrunken shards away?

ELK-ES-A happens to be the only cold node in this setup. Curator performs the shrink action on ELK-ES-A and then sets index.routing.allocation.require.node_type to cold (which happens to also be ELK-ES-A) on the shrunken index.
I probably should have edited the action file for this thread to make it more clear that the post allocation routed the shards to the same node as the shrink node.


(David Turner) #6

Ah right, this is all becoming clearer. Yes, shrinking on another node and then allowing post_allocation to move the shards to their final resting spot should do the job.