Curator error when shrinking

I am getting the following error message when shrinking 4 indexes via curator, I really can't tell from the DEBUG about if this is a python issue or a elasticsearch load issue,

2019-05-08 02:41:35,249 ERROR                curator.cli                    run:191  Failed to complete action: shrink.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'127.0.0.1', port=9200): Read timed out. (read timeout=30))

This says that Curator lost its connection to the Elasticsearch host it was connected to. The default timeout is 30 seconds. You can increase this in the Curator client YAML configuration file with the timeout setting. However, it should not be keeping a client connection open for that long, as the shrink process should result in a task ID, which Curator then polls at an interval until complete.

Thanks Aaron, we did notice an issue with our disks running out of space manually so I wonder if that was the culprit with curator. When we get our disks in order I will try again to see what the output is. It would be nice for curator to give back the output of the results of the allocation api decisions.

It might show that if loglevel: DEBUG is configured. Otherwise, feel free to add a feature request at https://github.com/elastic/curator

it seems like even after our fixes to our disks curator does not do well with shrinking many indexes at once in our case we are trying to do this and we get the below

INFO      curator.actions.shrink              do_action:2132 Shrinking 355 selected indices:
19-05-08 20:40:24,348 INFO      curator.actions.shrink              do_action:2138 Source index: application-2018.10.05 -- Target index: application-2018.10.05-shrink
2019-05-08 20:40:24,348 DEBUG     curator.actions.shrink       pre_shrink_check:2074 BEGIN PRE_SHRINK_CHECK
2019-05-08 20:40:24,348 DEBUG     curator.actions.shrink       pre_shrink_check:2075 Check that target exists
2019-05-08 20:40:24,357 DEBUG     curator.actions.shrink       pre_shrink_check:2077 Check doc count constraints
2019-05-08 20:40:24,397 DEBUG     curator.actions.shrink       pre_shrink_check:2079 Check shard count
2019-05-08 20:40:24,420 DEBUG     curator.actions.shrink       pre_shrink_check:2082 Check shard factor
2019-05-08 20:40:24,421 DEBUG     curator.actions.shrink       pre_shrink_check:2084 Check node availability
2019-05-08 20:41:47,157 ERROR                curator.cli                    run:191  Failed to complete action: shrink.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'127.0.0.1', port=9200): Read timed out. (read timeout=60))

I am now doing less indexes to see if the quantity was a problem and same issue. Is there a way to see whats happening under the hood, i have it in DEBUG mode already

INFO      curator.actions.shrink              do_action:2132 Shrinking 7 selected indices: [u'application-2018.05.14', u'application-2018.05.16', u'application-2018.05.15', u'application-2018.05.17', u'application-2018.05.12', u'application-2018.05.13', u'application-2018.05.18']
2019-05-08 20:49:47,125 ERROR                curator.cli                    run:191  Failed to complete action: shrink.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host=u'127.0.0.1', port=9200): Read timed out. (read timeout=60))

This is probably why things are timing out:

You appear to be trying to find a place to put 355 indices in order to shrink them. As you are already experiencing disk space issues, this approach will be completely ineffective. If you have disk space issues, shrink is not going to correct that problem, as shrink only reduces the shard count, and not the amount of disk space consumed. The only things which will reduce disk space are:

  1. Delete old data (most immediately effective)
  2. Add more cluster nodes and let the shards spread out (also effective, but more costly in adding hardware with associated resource costs)
  3. Turn on best_compression and do a force merge (will result in a ton of CPU and disk I/O in order to accomplish this, with potentially minimal return on investment).

I understand these are not pleasant to hear, but they are the unvarnished truth.

if you notice aaron the first time i tried doing around 300 indicies but the second time I only did about 7 or so and the same result occurred. We have a 500GB zfs volume mounted across 10 ten nodes so I dont think disk is the problem anymore, there is something under the hood that I am not able to pin point.

thanks

okay so it started working after a restart of the cluster but now i see this,

2019-05-09 16:41:19,398 ERROR                curator.cli                    run:191  Failed to complete action: shrink.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: Unable to shrink index "application-2018.10.05" as not all shards were found on the designated shrink node (I8bplcd): [{'shard': u'2', 'primary': True}]

when i look at it with

GET /_cluster/allocation/explain?include_yes_decisions

i see this error

reached the limit of outgoing shard recoveries [2] on the node [HN_wtcaDRY2GTVyStD1BRw] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"

should i increase that setting?

ZFS is not recommended for use with Elasticsearch because of how much memory the ARC tries to claim.

That depends™

  • How fast is your network backend between your data nodes?
  • How many queries per second is your cluster handling currently?
  • How much are you ingesting from moment to moment? In events or docs per second?

You still have not explained why you are trying to shrink 7 indices in a single pass, leave alone 355. Shrinking is a highly network and disk I/O intensive process in that a copy (primary or replica) of each shard must be migrated to the same node for the shrink process to work.

What is your use case for shrinking this many indices?

About a year ago we started off creating indexes with 5 primary shards and 1 replica and after 1 year of indexing we noticed that our dataset does not require 5 primary shards and can do with 1.

the network is fast between the data nodes, they are vms in the same zone.
queries per second is minimal since this is a staging stack that is not touched
ingestion is also minimal 15877 /s

we need to shrink a years worth of indexes because our current shard settings are not efficient.

Do not attempt to shrink more than one at a time until all of the moving pieces are fully understood after getting a single index shrunk. Do one at a time for perhaps the first 3 to 10 indices, even, until you get a feel for the process, how long it takes, and where it pauses, and why.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.