Curator: Error with forcemerge

Hi,
I'm seeing the following error with running forcemerge from curator:

{
	"@timestamp": "2018-06-19T07:14:14.643Z",
	"function": "run",
	"linenum": 184,
	"loglevel": "ERROR",
	"message": "Failed to complete action: forcemerge.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(504, u\"<html><body><h1>504 Gateway Time-out</h1>\\nThe server didn't respond in time.\\n</body></html>\\n\")",
	"name": "curator.cli"
}

Action config looks like this:

actions:
1:
action: forcemerge
description: "Index forcemerge test"
options:
max_num_segments: 1
timeout_override: 43200
ignore_empty_list: true
delay: 300
continue_if_exception: false
disable_action: false
filters:
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 1
- filtertype: forcemerged
max_num_segments: 1
exclude: true

I don't see anything suspicious in the elasticsearch logs but do see tasks queueing up so it looks like it's working. However, I've not yet seen an index get completely merged down to a single segment per shard.

Note: Curator is running from a docker container in kubernetes.

5XX errors are server side, while Curator is a client side process. A 4XX error would indicate Curator made a bad call. A 504 error indicates that there is a proxy, load balancer, or other gateway between Curator and your Elasticsearch node. No matter what you set your timeout_override to, it's longer than the timeout the gateway (whatever type it may be) allows. More complete debug logging would show how long the client was connected, so you would be able to see this. It will be a nearly perfect amount of seconds, like 60, 120, or 300, usually.

This isn't something Curator can compensate for, unfortunately. Forcemerge doesn't record a _task in the Tasks API, or set a lock in the cluster state or anything like that. A forcemerge sets an invisible block in the cluster state that prevents any other forcemerges from running while another is in progress. This is an opaque process, unfortunately, so Curator simply cannot see what's going on enough to reconnect and resume after a 504 disconnect.

Thanks, I'll have a look into how out load-balancers are set up

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.