Absolute values in cluster.routing.allocation.disk.watermark are not honored?

We run ES 2.3.3 on 6 data nodes with 1TB of disk space each.
The cluster.routing.allocation.disk.watermark settings are "low": "20gb", "high": "10gb" (the full cluster settings are here)

We also have Marvel plugin installed with free license.

Half of the nodes in our cluster reached 85% disk space used. Roughly after that time, creation of new indices and update of aliases started to fail. Here are the relevant log messages:

[2016-10-05 12:17:30,617][DEBUG][action.bulk              ] [esdata-6-2228493798-jio71] [.marvel-es-1-2016.10.05][0] failed to execute bulk item (index) index {[.marvel-es-1-2016.10.05][node_stats][AVeUxoMw4grHVilFfn-q], source[{"cluster_uuid":"AA2qu8pMTN
-X3Z04H6cDeQ","timestamp":"2016-10-05T12:17:00.208Z","source_node":{"uuid":"9iJdaocQSJO7oXo2Eux7XQ","host":"x.x.0.3","transport_address":"10.200.0.3:9300","ip":"10.200.0.3","name":"esdata-6-2228493798-jio71","attributes":{"master":"false"}},"node_stats
":{"node_id":"9iJdaocQSJO7oXo2Eux7XQ","node_master":false,"mlockall":true,"disk_threshold_enabled":true,"disk_threshold_watermark_high":100.0,"indices":{"docs":{"count":xxxxx},"store":{"size_in_bytes":xxxxx,"throttle_time_in_millis":0},"indexi
ng":{"index_total":xxxxx,"index_time_in_millis":66836428,"throttle_time_in_millis":0},"search":{"query_total":364632,"query_time_in_millis":20824188},"segments":{"count":8696}},"os":{"load_average":2.6533203125},"process":{"open_file_descriptors":1271
2,"max_file_descriptors":1000000,"cpu":{"percent":7}},"jvm":{"mem":{"heap_used_in_bytes":xxxxx,"heap_used_percent":65},"gc":{"collectors":{"young":{"collection_count":119572,"collection_time_in_millis":5672431},"old":{"collection_count":7,"collectio
n_time_in_millis":30127}}}},"thread_pool":{"bulk":{"rejected":0},"index":{"rejected":0},"search":{"rejected":0}},"fs":{"total":{"total_in_bytes":1056759873536,"free_in_bytes":162053849088,"available_in_bytes":113697746944}}}}]}
ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]
        at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-10-05 12:17:30,619][ERROR][marvel.agent             ] [esdata-6-2228493798-jio71] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
        at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
        at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
        at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
        at java.lang.Thread.run(Thread.java:745)
        Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2016.10.05], type [node_stats], id [AVeUxoMw4grHVilFfn-q], message [ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]]];
                at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
                ... 3 more
        Caused by: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2016.10.05], type [node_stats], id [AVeUxoMw4grHVilFfn-q], message [ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]]]
                at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:118)
                at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
                ... 3 more

(BTW, I wonder what Marvel means by "disk_threshold_watermark_high":100.0 value).

Trying to create test index timed out as well:

[2016-10-05 13:48:08,537][DEBUG][action.index             ] [esdata-2-246160862-vk0zb] failed to execute [index {[test][foo][1], source[{
  "a":"r"
}
]}] on [[test][3]]
ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [foo]) within 30s]
        at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

We added another node to the cluster. It joined happily, but relocation didn't start and the cluster remained red due to two unassigned shards belonging to marvel index from Sep 28. Only after removing that index, the cluster came green and relocation started.

Any ideas what happened here?

Marvel only shows the correct value for diskThresholdWatermarkHigh if it's given as a percentage.

The logs show that Marvel tries to update the mapping (maybe because it got confused with the diskThresholdWatermarkHigh?). The logs don't show that the watermarks are not honoured. Are there any other lines in the logs that might be of interest?

As the disks on all your nodes are of same size, the simplest fix to reenable marvel might be to just put the watermarks again as percentages.

Hi Yannick,

We did not change watermarks to the absolute values just yesterday - it like that for months (and we prefer working with absolute values).

I think what Marvel was trying to do is to create a new index and put mapping, which failed. Our manual attempt to create test index resulted in similar error. Other than that, we've got many errors when trying to update index aliases:

05:00:01.000
[2016-10-05 02:00:01,065][DEBUG][action.admin.indices.alias] [esmaster-3-xxxxx] failed to perform aliases
05:00:01.000
ProcessClusterEventTimeoutException[failed to process cluster event (index-aliases) within 30s]
05:00:01.000
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
05:00:01.000
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
05:00:01.000
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
05:00:01.000
at java.lang.Thread.run(Thread.java:745)

These errors you see on index creation / alias show that the master node is too busy with existing tasks to process the requests in a 30 second time-frame.

Can you have a look at the pending_tasks and post the output here?

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-pending.html

I did check the tasks during the problematic time frame. It only had a single task - the one that tried to update the alias. After it timed out, the task queue was empty, until new alias update attempt.

That's odd. It means that the task was submitted to the queue but did not get to execute. As you see no other tasks blocking it from execution, there is no good reason why this should fail. Can you restart the master node and check if the issue persists?

We've already managed to get out of this issue yesterday:

  • Added one more data node, but it did not help much, rebalancing didn't even start. However creating new indices succeeded and all of their shards landed on the new node.
  • Removed one old .marvel- index which had unassigned shards - this did the trick and things returned back to normal.