Failed to process cluster event (create-index-template) within 30s

First post.

History

  • This is our production instance and it has been down now for 5 days
  • I have inherited a system that has been running for years without issue.
  • The system has not been updated, so I find it very surprising this problem has arisen.
  • I have also rebuilt the system completely from scratch (bosh setup scripts) and the problem remains.
  • Have never setup ES before, just learning through looking at logs

Versions

Running inside of Amazon
# ./elasticsearch --version
Version: 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_141

VMs

  1. ingestor/0
  2. kibana/0
  3. maintenance/0
  4. cluster_monitor/0
  5. elasticsearch_data/0
  6. elasticsearch_master/0
  7. ls-router/0
  8. elasticsearch_data/1
  9. elasticsearch_data/2

Details

I have tried to start the ES services on 1 vm at a time to try to understand the sequencing of events.

  1. Started Kibana process

elasticsearch.stdout.log

[2019-09-13T23:33:27,583][INFO ][o.e.n.Node               ] [kibana/0] initialized
[2019-09-13T23:33:27,584][INFO ][o.e.n.Node               ] [kibana/0] starting ...
[2019-09-13T23:33:27,856][INFO ][o.e.t.TransportService   ] [kibana/0] publish_address {10.249
.1.127:9300}, bound_addresses {0.0.0.0:9300}
[2019-09-13T23:33:27,882][INFO ][o.e.b.BootstrapChecks    ] [kibana/0] bound or publishing to 
a non-loopback or non-link-local address, enforcing bootstrap checks
[2019-09-13T23:33:30,973][WARN ][o.e.d.z.ZenDiscovery     ] [kibana/0] not enough master nodes
 discovered during pinging (found [[]], but needed [1]), pinging again

# tail -f kibana.stdout.log

{"type":"log","@timestamp":"2019-09-13T23:33:54Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"No living connections"}
{"type":"log","@timestamp":"2019-09-13T23:33:57Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"Unable to revive connection: http://127.0.0.1:9200/"}
{"type":"log","@timestamp":"2019-09-13T23:33:57Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"No living connections"}
{"type":"log","@timestamp":"2019-09-13T23:33:59Z","tags":["status","plugin:elasticsearch@5.5.2
","error"],"pid":15331,"state":"red","message":"Status changed from red to red - Service Unava
ilable","prevState":"red","prevMsg":"Unable to connect to Elasticsearch at http://127.0.0.1:92
00."}

Ok, this makes sense as I believe the Master is responsible for starting up port 9200

Master

Start up ES on master.

Everything looks good until

[2019-09-13T23:36:28,654][INFO ][o.e.p.PluginsService     ] [elasticsearch_master/0] no plugin
s loaded                                                                                      
[2019-09-13T23:36:28,748][WARN ][o.e.d.c.s.Setting        ] [path.scripts] setting was depreca
ted in Elasticsearch and will be removed in a future release! See the breaking changes documen
tation for the next major version.                                                            
[2019-09-13T23:36:32,345][WARN ][o.e.d.c.s.Setting        ] [path.scripts] setting was depreca
ted in Elasticsearch and will be removed in a future release! See the breaking changes documen
tation for the next major version.                                                            
[2019-09-13T23:36:36,475][INFO ][o.e.d.DiscoveryModule    ] [elasticsearch_master/0] using dis
covery type [zen]                                                                             
[2019-09-13T23:36:41,243][INFO ][o.e.n.Node               ] [elasticsearch_master/0] initializ
ed                                                                                            
[2019-09-13T23:36:41,244][INFO ][o.e.n.Node               ] [elasticsearch_master/0] starting 
...                                                                                           
[2019-09-13T23:36:41,928][INFO ][o.e.t.TransportService   ] [elasticsearch_master/0] publish_a
ddress {10.249.1.121:9300}, bound_addresses {0.0.0.0:9300}                                    
[2019-09-13T23:36:42,018][INFO ][o.e.b.BootstrapChecks    ] [elasticsearch_master/0] bound or 
publishing to a non-loopback or non-link-local address, enforcing bootstrap checks            
[2019-09-13T23:36:45,293][INFO ][o.e.c.s.ClusterService   ] [elasticsearch_master/0] new_maste
r {elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.24
9.1.121:9300}, added {{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1.127}
{10.249.1.127:9300},}, reason: zen-disco-elected-as-master ([1] nodes joined)[{kibana/0}{iSU1a
7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1.127}{10.249.1.127:9300}]                  
[2019-09-13T23:36:45,484][INFO ][o.e.h.n.Netty4HttpServerTransport] [elasticsearch_master/0] p
ublish_address {10.249.1.121:9200}, bound_addresses {0.0.0.0:9200}                            
[2019-09-13T23:36:45,484][INFO ][o.e.n.Node               ] [elasticsearch_master/0] started  
[2019-09-13T23:36:49,939][INFO ][o.e.g.GatewayService     ] [elasticsearch_master/0] recovered
 [9] indices into cluster_state                                                               
[2019-09-13T23:36:52,050][WARN ][r.suppressed             ] path: /_bulk, params: {}          
java.lang.IllegalStateException: There are no ingest nodes in this cluster, unable to forward 
request to an ingest node.                                                                    
        at org.elasticsearch.action.ingest.IngestActionForwarder.randomIngestNode(IngestAction
Forwarder.java:58) ~[elasticsearch-5.5.2.jar:5.5.2]

Continuing on reply to this post due to size limitations.

Ingestor

Switch over to the ingestor VM and start it up

All good, sees the master

[2019-09-13T23:38:14,908][INFO ][o.e.n.Node               ] [ingestor/0] initialized
[2019-09-13T23:38:14,910][INFO ][o.e.n.Node               ] [ingestor/0] starting ...
[2019-09-13T23:38:15,392][INFO ][o.e.t.TransportService   ] [ingestor/0] publish_address {10.2
49.1.128:9300}, bound_addresses {0.0.0.0:9300}
[2019-09-13T23:38:15,411][INFO ][o.e.b.BootstrapChecks    ] [ingestor/0] bound or publishing t
o a non-loopback or non-link-local address, enforcing bootstrap checks
[2019-09-13T23:38:45,458][WARN ][o.e.n.Node               ] [ingestor/0] timed out while waiti
ng for initial discovery state - timeout: 30s
[2019-09-13T23:38:45,472][INFO ][o.e.h.n.Netty4HttpServerTransport] [ingestor/0] publish_addre
ss {10.249.1.128:9200}, bound_addresses {0.0.0.0:9200}
[2019-09-13T23:38:45,472][INFO ][o.e.n.Node               ] [ingestor/0] started
[2019-09-13T23:38:49,837][INFO ][o.e.c.s.ClusterService   ] [ingestor/0] detected_master {elas
ticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121
:9300}, added {{elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249
.1.121}{10.249.1.121:9300},{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1
.127}{10.249.1.127:9300},}, reason: zen-disco-receive(from master [master {elasticsearch_maste
r/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121:9300} committe
d version [183]])

Timeout

So right after starting the ingestor, I now get the timeout error in the MASTER log

[2019-09-13T23:39:48,814][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]

maintenance

I think the maintenance server is responsible for creating those indices. So I have started it up.

tail -f job.control.log 
+ URL=http://10.249.1.121:9200
++ awk '{print $1}'
++ curl -s http://10.249.1.121:9200/_cat/templates

No change in the masters log. Still reporting:

[2019-09-13T23:45:35,852][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s      

The Kibana web page is showing a status of RED.

Status Breakdown

|ID|Status|
| --- | --- |
|ui settings|Elasticsearch plugin is red|
|plugin:kibana@5.5.2|Ready|
|plugin:elasticsearch@5.5.2|Elasticsearch is still initializing the kibana index.|
|plugin:metrics@5.5.2|Ready|
|plugin:timelion@5.5.2|Ready|

Data0

If I start 1 (of 3) cluster servers.

Everything looks good in the cluster

[2019-09-13T23:48:20,184][INFO ][o.e.t.TransportService   ] [elasticsearch_data/0] publish_add
ress {10.249.1.124:9300}, bound_addresses {0.0.0.0:9300}                                      
[2019-09-13T23:48:20,200][INFO ][o.e.b.BootstrapChecks    ] [elasticsearch_data/0] bound or pu
blishing to a non-loopback or non-link-local address, enforcing bootstrap checks              
[2019-09-13T23:48:50,234][WARN ][o.e.n.Node               ] [elasticsearch_data/0] timed out w
hile waiting for initial discovery state - timeout: 30s                                       
[2019-09-13T23:48:50,245][INFO ][o.e.h.n.Netty4HttpServerTransport] [elasticsearch_data/0] pub
lish_address {10.249.1.124:9200}, bound_addresses {0.0.0.0:9200}                              
[2019-09-13T23:48:50,245][INFO ][o.e.n.Node               ] [elasticsearch_data/0] started    
[2019-09-13T23:48:54,621][INFO ][o.e.c.s.ClusterService   ] [elasticsearch_data/0] detected_ma
ster {elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10
.249.1.121:9300}, added {{ingestor/0}{aXP_QjwMT2qND61qqpBN_A}{MMWR6-tgRN-5dliZUGs72Q}{10.249.1
.128}{10.249.1.128:9300},{elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9e
sg}{10.249.1.121}{10.249.1.121:9300},{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg
}{10.249.1.127}{10.249.1.127:9300},}, reason: zen-disco-receive(from master [master {elasticse
arch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121:9300
} committed version [893]])    

Continuing in Reply due to size limitations

Masters log

[2019-09-13T23:49:36,998][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]                               
        at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_141]                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$o
nTimeout$1(ClusterService.java:254) ~[elasticsearch-5.5.2.jar:5.5.2]                          
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.ru
n(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]                                     
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:
1.8.0_141]                                                                                    
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:
1.8.0_141]                                                                                    
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]                                
[2019-09-13T23:49:37,000][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]                               
        at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_141]                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$o
nTimeout$1(ClusterService.java:254) ~[elasticsearch-5.5.2.jar:5.5.2]                          
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.ru
n(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]                                     
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:
1.8.0_141]                                                                                    
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:
1.8.0_141]                                                                                    
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]                                
[2019-09-13T23:49:37,245][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_master/0] [gc][75
4] overhead, spent [305ms] collecting in the last [1.1s]                   

I am hoping someone can tell me the following:

  1. Who is supposed to be creating the index, I have 9 VMs, not sure which one is responsible for it
  2. If it is a timeout, is the master trying to reach a different VM that is not up? If so, which one?
  3. Could it be a permissions issue on some drive somewhere that is leading to the failure on creating the index (system has run for years unchanged)
  4. Could it be a port issue on some drive somewhere that is leading to the failure on creating the index (system has run for years unchanged and is scripted)

Basically at this point I am hoping someone can direct me to look at something else to try to track this down.

If you need any additional information (or full logs) please just ask.

Thank you for any responses, much appreciated.
David

This seems to indicate that updating and propagating the cluster state is taking far to long and therefore timing out. This can happen for a number of reasons. As the cluster has been operational for a long time without issues and you have not done anything to it it is possible that the amount of indices and data has grown over time and that either the cluster state has gotten too large or some hodes are struggling with heap preassure.

It would help if you can post the full output of the cluster stats API.

The warning is about creating an index template, not an index, and I think it's being created by an external filebeat and not by Elasticsearch itself. I would expect that this template already exists, but filebeat is issuing another create request to make sure anyway. I also imagine that this is timing out while in the queue because the master is busy doing something else, because creating a template should be a pretty quick operation.

No, the master will only be communicating with nodes that are running (or at least, were running very recently).

I doubt that would result in a timeout.

I don't know what you mean by "port issue on a drive" but it's possible that slow IO on one or more nodes could cause this timeout if that's what you mean.

In addition to @Christian_Dahlqvist's suggestion to share the cluster stats output, I would suggest looking at GET _cluster/pending_tasks to see what is in the master's task queue. It is possible that it is overloaded by tasks resulting from the restart, particularly if this cluster is not often restarted.

I appreciate the responses.

Since this has been running a long time and now has suddenly failed, so based on your responses I thought it might be related to underpowered VMs.

We were using a m3.medium (3.75 GiB RAM, 1 vCPU), so I have recreated it as m3.large (7.5 GiB RAM, 2 vCPUs).

The 3 data clusters had 388 GB of data before I re-created them.
We have around 700 different machines feeding in via filebeat. Probably each machine has at least 7 files each.

Looks like the server is up and running again now, there are still a massive number of pending tasks, but I am pretty sure it will catch up.

Thank you so much for the direction.

Here is the requested information.

$ curl http://10.249.1.121:9200/_cluster/health

{
	"cluster_name": "logsearch",
	"status": "green",
	"timed_out": false,
	"number_of_nodes": 6,
	"number_of_data_nodes": 3,
	"active_primary_shards": 37,
	"active_shards": 74,
	"
	relocating_shards ":2,"
	initializing_shards ":0,"
	unassigned_shards ":0,"
	delayed_unassigned_shards ":0,"
	number_of_pending_tasks ":1144,"
	number_of_in_flight_fetch ": 0,
	"task_max_waiting_in_queue_millis": 317937,
	"active_shards_percent_as_number": 100.0
}

$ curl http://10.249.1.121:9200/_cluster/stats

   {
   	"_nodes": {
   		"total": 6,
   		"successful": 6,
   		"failed": 0
   	},
   	"cluster_name": "logsearch",
   	"timestamp": 1568507572097,
   	"status": "green",
   	"indices": {
   		"count": 11,
   		"shards": {
   			"tota
   			l ":74,"
   			primaries ":37,"
   			replication ":1.0,"
   			index ":{"
   			shards ":{"
   			min ":2,"
   			max ":10,"
   			avg ":6.7272727272727275},"
   			primaries ":{"
   			min ":1,"
   			max ":5,"
   			avg ":3.3636363636363638},
   			"replication": {
   				"min": 1.0,
   				"max": 1.0,
   				"avg": 1.0
   			}
   		}
   	},
   	"docs": {
   		"count": 2814782,
   		"deleted": 0
   	},
   	"store": {
   		"size_in_bytes": 3683627954,
   		"throttle_time_in_millis": 0
   	},
   	"fie
   	lddata ":{"
   	memory_size_in_bytes ":0,"
   	evictions ":0},"
   	query_cache ":{"
   	memory_size_in_bytes ":992,"
   	total_count ":339,"
   	hit_count ":146,"
   	miss_count ":193,"
   	cache_size ":
   	35,
   	"cache_count": 35,
   	"evictions": 0
   }, "completion": {
   	"size_in_bytes": 0
   }, "segments": {
   	"count": 300,
   	"memory_in_bytes": 12099179,
   	"terms_memory_in_bytes": 9347448,
   	"sto
   	red_fields_memory_in_bytes ":1466280,"
   	term_vectors_memory_in_bytes ":0,"
   	norms_memory_in_bytes ":55296,"
   	points_memory_in_bytes ":151987,"
   	doc_values_memory_in_by
   	tes ":1078168,"
   	index_writer_memory_in_bytes ":0,"
   	version_map_memory_in_bytes ":0,"
   	fixed_bit_set_memory_in_bytes ":0,"
   	max_unsafe_auto_id_timestamp ":156850735812
   	1,
   	"file_sizes": {}
   }
   }, "nodes": {
   	"count": {
   		"total": 6,
   		"data": 3,
   		"coordinating_only": 2,
   		"master": 1,
   		"ingest": 3
   	},
   	"versions": ["5.5.2"],
   	"os": {
   		"available_processors": 12,
   		"allocated_processors": 12,
   		"names": [{
   			"name": "Linux",
   			"count": 6
   		}],
   		"mem": {
   			"total_in_bytes": 39212023808,
   			"free_in_bytes": 8356122624,
   			"used_in_bytes": 30855901184,
   			"
   			free_percent ":21,"
   			used_percent ":79}},"
   			process ":{"
   			cpu ":{"
   			percent ":116},"
   			open_file_descriptors ":{"
   			min ":238,"
   			max ":1511,"
   			avg ":485}},"
   			jvm ":{"
   			max_uptime_in_milli
   			s ":1862445,"
   			versions ":[{"
   			version ":"
   			1.8 .0 _141 ","
   			vm_name ":"
   			OpenJDK 64 - Bit Server VM ","
   			vm_version ":"
   			25.141 - b15 ","
   			vm_vendor ":"
   			Oracle Corporation ","
   			count ":6}],"
   			mem ":{"
   			heap_used_in_bytes ":4304717672,"
   			heap_max_in_bytes ":17939300352},"
   			threads ":275},"
   			fs ":{"
   			total_in_bytes ":1565499260928,"
   			free_in_bytes ":1560262987776,"
   			a
   			vailable_in_bytes ":1480598691840},"
   			plugins ":[],"
   			network_types ":{"
   			transport_types ":{"
   			netty4 ":6},"
   			http_types ":{"
   			netty4 ":6
		}
	}
}

$ curl http://10.249.1.121:9200/_cluster/pending_tasks

I stopped the output, there are at least 5372 tasks pending.
Here is one of them

{
	"insert_order": 16603,
	"priority": "URGENT",
	"source": "create-index-template [filebeat-6.1.2], cause [api]",
	"executing": true,
	"time_in_queue_millis": 29857,
	"time_in_queue": "29.8s"
}

David