Can't shrink, how to debug curator?


(D Pattee) #1

Working on a test stack, elasticsearch and curator running on a single node, kibana running on a separate node, indexes set to zero repls, beat indexs set to the default 'daily index'... Experimenting with curator shrink and forcemerge features. Last time I was using curator I was just automating index drops...

Filebeat indexes ranging from 2-15 megs, metricbeat indexes ranging from 250-900 megs.

A shrink on a filebeat index returns almost immediately. Like shrink and forcemerge on a dozen indexes totalling around 70 megs using curator with an actionfile took just a couple seconds.

But using the same actionfile (filter prefix modified...) and 4 hours later the first 250 meg index was still not shrunk.

I aborted that and tried again with curator_cli thinking I might get more information, but not really... Once it starts up it just loops waiting for progress over and over and over...

CPU usage is <2% as if it isn't even trying to do anything...
(Load Avg: 1.31, 1.71, 1.83 CPU usage: 1.19% user, 1.19% sys, 97.61% idle)


(D Pattee) #2

curator_cli --config /Applications/elastic-curator/curator.yml shrink --filter_list '{"filtertype":"pattern","kind":"prefix","value":"metricbeat-6.5.0-2018.11.25"}' --node_filters '{"permit_masters":"True"}'

2018-12-03 21:55:25,640 DEBUG          curator.indexlist           __actionable:35   Index metricbeat-6.5.0-2018.11.25 is actionable and remains in the list.
2018-12-03 21:55:25,640 DEBUG          curator.indexlist       empty_list_check:226  Checking for empty list
2018-12-03 21:55:25,640 INFO      curator.actions.shrink              do_action:2121 Source index: metricbeat-6.5.0-2018.11.25 -- Target index: metricbeat-6.5.0-2018.11.25-shrink
2018-12-03 21:55:25,640 DEBUG     curator.actions.shrink       pre_shrink_check:2058 BEGIN PRE_SHRINK_CHECK
2018-12-03 21:55:25,640 DEBUG     curator.actions.shrink       pre_shrink_check:2059 Check that target exists
2018-12-03 21:55:25,641 DEBUG     curator.actions.shrink       pre_shrink_check:2061 Check doc count constraints
2018-12-03 21:55:25,647 DEBUG     curator.actions.shrink       pre_shrink_check:2063 Check shard count
2018-12-03 21:55:25,703 DEBUG     curator.actions.shrink       pre_shrink_check:2066 Check shard factor
2018-12-03 21:55:25,703 DEBUG     curator.actions.shrink       pre_shrink_check:2068 Check node availability
2018-12-03 21:55:25,881 DEBUG              curator.utils        node_id_to_name:1845 Name associated with node_id "mBJqx51QRq2IL-pdkjU5Hg": mBJqx51
2018-12-03 21:55:25,882 WARNING   curator.actions.shrink             _data_node:1895 Not skipping node "mBJqx51" which is a master node (not recommended), but permit_masters is True
2018-12-03 21:55:25,960 DEBUG     curator.actions.shrink       pre_shrink_check:2070 Check available disk space
2018-12-03 21:55:25,967 DEBUG     curator.actions.shrink           _check_space:1998 Sufficient space available for 2x the size of index "metricbeat-6.5.0-2018.11.25".  Required: 470023636, available: 11918053801984
2018-12-03 21:55:25,967 DEBUG     curator.actions.shrink       pre_shrink_check:2072 FINISH PRE_SHRINK_CHECK
2018-12-03 21:55:25,967 INFO      curator.actions.shrink              do_action:2125 Moving shards to shrink node: "mBJqx51"
2018-12-03 21:55:26,003 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 0 seconds
2018-12-03 21:55:26,004 DEBUG              curator.utils           health_check:1497 KWARGS= "{'relocating_shards': 0}"
2018-12-03 21:55:26,006 DEBUG              curator.utils           health_check:1517 MATCH: Value for key "0", health check data: 0
2018-12-03 21:55:26,006 INFO               curator.utils           health_check:1520 Health Check for all provided keys passed.
2018-12-03 21:55:26,006 DEBUG              curator.utils            wait_for_it:1763 Response: True
2018-12-03 21:55:26,006 DEBUG              curator.utils            wait_for_it:1768 Action "allocation" finished executing (may or may not have been successful)
2018-12-03 21:55:26,006 DEBUG              curator.utils            wait_for_it:1786 Result: True
2018-12-03 21:55:26,112 DEBUG              curator.utils           health_check:1497 KWARGS= "{'status': 'green'}"
2018-12-03 21:55:26,114 DEBUG              curator.utils           health_check:1517 MATCH: Value for key "green", health check data: green
2018-12-03 21:55:26,114 INFO               curator.utils           health_check:1520 Health Check for all provided keys passed.
2018-12-03 21:55:26,114 INFO      curator.actions.shrink              do_action:2135 Shrinking index "metricbeat-6.5.0-2018.11.25" to "metricbeat-6.5.0-2018.11.25-shrink" with settings: {'settings': {'index.number_of_shards': 1, 'index.number_of_replicas': 1}}, wait_for_active_shards=1
2018-12-03 21:55:56,242 DEBUG     curator.actions.shrink              do_action:2140 Wait for shards to complete allocation for index: metricbeat-6.5.0-2018.11.25-shrink
2018-12-03 21:55:56,242 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 0 seconds
2018-12-03 21:55:56,242 DEBUG              curator.utils           health_check:1497 KWARGS= "{'status': 'green'}"
2018-12-03 21:55:56,245 DEBUG              curator.utils           health_check:1511 NO MATCH: Value for key "green", health check data: red
2018-12-03 21:55:56,245 DEBUG              curator.utils            wait_for_it:1763 Response: False
2018-12-03 21:55:56,245 DEBUG              curator.utils            wait_for_it:1783 Action "shrink" not yet complete, 0 total seconds elapsed. Waiting 9 seconds before checking again.
2018-12-03 21:56:05,249 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 9 seconds
2018-12-03 21:56:05,250 DEBUG              curator.utils           health_check:1497 KWARGS= "{'status': 'green'}"
2018-12-03 21:56:05,253 DEBUG              curator.utils           health_check:1511 NO MATCH: Value for key "green", health check data: red
2018-12-03 21:56:05,253 DEBUG              curator.utils            wait_for_it:1763 Response: False
2018-12-03 21:56:05,253 DEBUG              curator.utils            wait_for_it:1783 Action "shrink" not yet complete, 9 total seconds elapsed. Waiting 9 seconds before checking again.
2018-12-03 21:56:14,256 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 18 seconds

...

2018-12-03 22:20:24,776 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 1468 seconds
2018-12-03 22:20:24,776 DEBUG              curator.utils           health_check:1497 KWARGS= "{'status': 'green'}"
2018-12-03 22:20:24,780 DEBUG              curator.utils           health_check:1511 NO MATCH: Value for key "green", health check data: red
2018-12-03 22:20:24,780 DEBUG              curator.utils            wait_for_it:1763 Response: False
2018-12-03 22:20:24,780 DEBUG              curator.utils            wait_for_it:1783 Action "shrink" not yet complete, 1468 total seconds elapsed. Waiting 9 seconds before checking again.
2018-12-03 22:20:33,790 DEBUG              curator.utils            wait_for_it:1760 Elapsed time: 1477 seconds
2018-12-03 22:20:33,791 DEBUG              curator.utils           health_check:1497 KWARGS= "{'status': 'green'}"
2018-12-03 22:20:33,794 DEBUG              curator.utils           health_check:1511 NO MATCH: Value for key "green", health check data: red
2018-12-03 22:20:33,794 DEBUG              curator.utils            wait_for_it:1763 Response: False
2018-12-03 22:20:33,794 DEBUG              curator.utils            wait_for_it:1783 Action "shrink" not yet complete, 1477 total seconds elapsed. Waiting 9 seconds before checking again.

(Aaron Mildenstein) #3

How many data nodes are in your test stack?


(D Pattee) #4

Just one. I already have the source indexes set to zero replication, and they are at 5 shards (which was the default for those metricbeat indexes). Cluster health was green when I started the operation (and of course it is red throughout the duration until I cancel the shrink and delete the target index).

I let the command line version run overnight, it is at ~13 hours now with no obvious progress. (no change in disk utilization, like a temporary doubling of that source index, no noticeable change in cpu or memory utilization for any of the java processes, etc)


(Aaron Mildenstein) #5

Yeah, something's wrong. A shrink should basically be instantaneous on a one node cluster, as there is nowhere else for the shards to be. Something is wrong here. Not sure what to make of it based on the logs you provided, but I can see that health check data: red when it was definitely green before. The shrink action in Curator is waiting for this status to be green. Some shard did not get properly handled, as red on a one node cluster means a primary shard is missing.


(D Pattee) #6

I aborted that single metricbeat index shrink and ran a shrink that hit 8 filebeat indices, it started at 12:55 and completed at 12:57. That set of source indexes only added up to 50 megs and 200k documents, compared to the single metricbeat index that was 225 megs with 540k documents, but no math can get me from <5 minutes to over 12 hours lol