We run ES 2.3.3 on 6 data nodes with 1TB of disk space each.
The cluster.routing.allocation.disk.watermark
settings are "low": "20gb"
, "high": "10gb"
(the full cluster settings are here)
We also have Marvel plugin installed with free license.
Half of the nodes in our cluster reached 85% disk space used. Roughly after that time, creation of new indices and update of aliases started to fail. Here are the relevant log messages:
[2016-10-05 12:17:30,617][DEBUG][action.bulk ] [esdata-6-2228493798-jio71] [.marvel-es-1-2016.10.05][0] failed to execute bulk item (index) index {[.marvel-es-1-2016.10.05][node_stats][AVeUxoMw4grHVilFfn-q], source[{"cluster_uuid":"AA2qu8pMTN
-X3Z04H6cDeQ","timestamp":"2016-10-05T12:17:00.208Z","source_node":{"uuid":"9iJdaocQSJO7oXo2Eux7XQ","host":"x.x.0.3","transport_address":"10.200.0.3:9300","ip":"10.200.0.3","name":"esdata-6-2228493798-jio71","attributes":{"master":"false"}},"node_stats
":{"node_id":"9iJdaocQSJO7oXo2Eux7XQ","node_master":false,"mlockall":true,"disk_threshold_enabled":true,"disk_threshold_watermark_high":100.0,"indices":{"docs":{"count":xxxxx},"store":{"size_in_bytes":xxxxx,"throttle_time_in_millis":0},"indexi
ng":{"index_total":xxxxx,"index_time_in_millis":66836428,"throttle_time_in_millis":0},"search":{"query_total":364632,"query_time_in_millis":20824188},"segments":{"count":8696}},"os":{"load_average":2.6533203125},"process":{"open_file_descriptors":1271
2,"max_file_descriptors":1000000,"cpu":{"percent":7}},"jvm":{"mem":{"heap_used_in_bytes":xxxxx,"heap_used_percent":65},"gc":{"collectors":{"young":{"collection_count":119572,"collection_time_in_millis":5672431},"old":{"collection_count":7,"collectio
n_time_in_millis":30127}}}},"thread_pool":{"bulk":{"rejected":0},"index":{"rejected":0},"search":{"rejected":0}},"fs":{"total":{"total_in_bytes":1056759873536,"free_in_bytes":162053849088,"available_in_bytes":113697746944}}}}]}
ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-10-05 12:17:30,619][ERROR][marvel.agent ] [esdata-6-2228493798-jio71] background thread had an uncaught exception
ElasticsearchException[failed to flush exporter bulks]
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:104)
at org.elasticsearch.marvel.agent.exporter.ExportBulk.close(ExportBulk.java:53)
at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:201)
at java.lang.Thread.run(Thread.java:745)
Suppressed: ElasticsearchException[failed to flush [default_local] exporter bulk]; nested: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2016.10.05], type [node_stats], id [AVeUxoMw4grHVilFfn-q], message [ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]]];
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:106)
... 3 more
Caused by: ElasticsearchException[failure in bulk execution:
[0]: index [.marvel-es-1-2016.10.05], type [node_stats], id [AVeUxoMw4grHVilFfn-q], message [ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [node_stats]) within 30s]]]
at org.elasticsearch.marvel.agent.exporter.local.LocalBulk.flush(LocalBulk.java:118)
at org.elasticsearch.marvel.agent.exporter.ExportBulk$Compound.flush(ExportBulk.java:101)
... 3 more
(BTW, I wonder what Marvel means by "disk_threshold_watermark_high":100.0
value).
Trying to create test index timed out as well:
[2016-10-05 13:48:08,537][DEBUG][action.index ] [esdata-2-246160862-vk0zb] failed to execute [index {[test][foo][1], source[{
"a":"r"
}
]}] on [[test][3]]
ProcessClusterEventTimeoutException[failed to process cluster event (put-mapping [foo]) within 30s]
at org.elasticsearch.cluster.service.InternalClusterService$2$1.run(InternalClusterService.java:349)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
We added another node to the cluster. It joined happily, but relocation didn't start and the cluster remained red due to two unassigned shards belonging to marvel index from Sep 28. Only after removing that index, the cluster came green and relocation started.
Any ideas what happened here?