Failed to process cluster event (create-index-template) within 30s

First post.

History

  • This is our production instance and it has been down now for 5 days
  • I have inherited a system that has been running for years without issue.
  • The system has not been updated, so I find it very surprising this problem has arisen.
  • I have also rebuilt the system completely from scratch (bosh setup scripts) and the problem remains.
  • Have never setup ES before, just learning through looking at logs

Versions

Running inside of Amazon
# ./elasticsearch --version
Version: 5.5.2, Build: b2f0c09/2017-08-14T12:33:14.154Z, JVM: 1.8.0_141

VMs

  1. ingestor/0
  2. kibana/0
  3. maintenance/0
  4. cluster_monitor/0
  5. elasticsearch_data/0
  6. elasticsearch_master/0
  7. ls-router/0
  8. elasticsearch_data/1
  9. elasticsearch_data/2

Details

I have tried to start the ES services on 1 vm at a time to try to understand the sequencing of events.

  1. Started Kibana process

elasticsearch.stdout.log

[2019-09-13T23:33:27,583][INFO ][o.e.n.Node               ] [kibana/0] initialized
[2019-09-13T23:33:27,584][INFO ][o.e.n.Node               ] [kibana/0] starting ...
[2019-09-13T23:33:27,856][INFO ][o.e.t.TransportService   ] [kibana/0] publish_address {10.249
.1.127:9300}, bound_addresses {0.0.0.0:9300}
[2019-09-13T23:33:27,882][INFO ][o.e.b.BootstrapChecks    ] [kibana/0] bound or publishing to 
a non-loopback or non-link-local address, enforcing bootstrap checks
[2019-09-13T23:33:30,973][WARN ][o.e.d.z.ZenDiscovery     ] [kibana/0] not enough master nodes
 discovered during pinging (found [[]], but needed [1]), pinging again

# tail -f kibana.stdout.log

{"type":"log","@timestamp":"2019-09-13T23:33:54Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"No living connections"}
{"type":"log","@timestamp":"2019-09-13T23:33:57Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"Unable to revive connection: http://127.0.0.1:9200/"}
{"type":"log","@timestamp":"2019-09-13T23:33:57Z","tags":["warning","elasticsearch","admin"],"
pid":15331,"message":"No living connections"}
{"type":"log","@timestamp":"2019-09-13T23:33:59Z","tags":["status","plugin:elasticsearch@5.5.2
","error"],"pid":15331,"state":"red","message":"Status changed from red to red - Service Unava
ilable","prevState":"red","prevMsg":"Unable to connect to Elasticsearch at http://127.0.0.1:92
00."}

Ok, this makes sense as I believe the Master is responsible for starting up port 9200

Master

Start up ES on master.

Everything looks good until

[2019-09-13T23:36:28,654][INFO ][o.e.p.PluginsService     ] [elasticsearch_master/0] no plugin
s loaded                                                                                      
[2019-09-13T23:36:28,748][WARN ][o.e.d.c.s.Setting        ] [path.scripts] setting was depreca
ted in Elasticsearch and will be removed in a future release! See the breaking changes documen
tation for the next major version.                                                            
[2019-09-13T23:36:32,345][WARN ][o.e.d.c.s.Setting        ] [path.scripts] setting was depreca
ted in Elasticsearch and will be removed in a future release! See the breaking changes documen
tation for the next major version.                                                            
[2019-09-13T23:36:36,475][INFO ][o.e.d.DiscoveryModule    ] [elasticsearch_master/0] using dis
covery type [zen]                                                                             
[2019-09-13T23:36:41,243][INFO ][o.e.n.Node               ] [elasticsearch_master/0] initializ
ed                                                                                            
[2019-09-13T23:36:41,244][INFO ][o.e.n.Node               ] [elasticsearch_master/0] starting 
...                                                                                           
[2019-09-13T23:36:41,928][INFO ][o.e.t.TransportService   ] [elasticsearch_master/0] publish_a
ddress {10.249.1.121:9300}, bound_addresses {0.0.0.0:9300}                                    
[2019-09-13T23:36:42,018][INFO ][o.e.b.BootstrapChecks    ] [elasticsearch_master/0] bound or 
publishing to a non-loopback or non-link-local address, enforcing bootstrap checks            
[2019-09-13T23:36:45,293][INFO ][o.e.c.s.ClusterService   ] [elasticsearch_master/0] new_maste
r {elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.24
9.1.121:9300}, added {{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1.127}
{10.249.1.127:9300},}, reason: zen-disco-elected-as-master ([1] nodes joined)[{kibana/0}{iSU1a
7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1.127}{10.249.1.127:9300}]                  
[2019-09-13T23:36:45,484][INFO ][o.e.h.n.Netty4HttpServerTransport] [elasticsearch_master/0] p
ublish_address {10.249.1.121:9200}, bound_addresses {0.0.0.0:9200}                            
[2019-09-13T23:36:45,484][INFO ][o.e.n.Node               ] [elasticsearch_master/0] started  
[2019-09-13T23:36:49,939][INFO ][o.e.g.GatewayService     ] [elasticsearch_master/0] recovered
 [9] indices into cluster_state                                                               
[2019-09-13T23:36:52,050][WARN ][r.suppressed             ] path: /_bulk, params: {}          
java.lang.IllegalStateException: There are no ingest nodes in this cluster, unable to forward 
request to an ingest node.                                                                    
        at org.elasticsearch.action.ingest.IngestActionForwarder.randomIngestNode(IngestAction
Forwarder.java:58) ~[elasticsearch-5.5.2.jar:5.5.2]

Continuing on reply to this post due to size limitations.

Ingestor

Switch over to the ingestor VM and start it up

All good, sees the master

[2019-09-13T23:38:14,908][INFO ][o.e.n.Node               ] [ingestor/0] initialized
[2019-09-13T23:38:14,910][INFO ][o.e.n.Node               ] [ingestor/0] starting ...
[2019-09-13T23:38:15,392][INFO ][o.e.t.TransportService   ] [ingestor/0] publish_address {10.2
49.1.128:9300}, bound_addresses {0.0.0.0:9300}
[2019-09-13T23:38:15,411][INFO ][o.e.b.BootstrapChecks    ] [ingestor/0] bound or publishing t
o a non-loopback or non-link-local address, enforcing bootstrap checks
[2019-09-13T23:38:45,458][WARN ][o.e.n.Node               ] [ingestor/0] timed out while waiti
ng for initial discovery state - timeout: 30s
[2019-09-13T23:38:45,472][INFO ][o.e.h.n.Netty4HttpServerTransport] [ingestor/0] publish_addre
ss {10.249.1.128:9200}, bound_addresses {0.0.0.0:9200}
[2019-09-13T23:38:45,472][INFO ][o.e.n.Node               ] [ingestor/0] started
[2019-09-13T23:38:49,837][INFO ][o.e.c.s.ClusterService   ] [ingestor/0] detected_master {elas
ticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121
:9300}, added {{elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249
.1.121}{10.249.1.121:9300},{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg}{10.249.1
.127}{10.249.1.127:9300},}, reason: zen-disco-receive(from master [master {elasticsearch_maste
r/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121:9300} committe
d version [183]])

Timeout

So right after starting the ingestor, I now get the timeout error in the MASTER log

[2019-09-13T23:39:48,814][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]

maintenance

I think the maintenance server is responsible for creating those indices. So I have started it up.

tail -f job.control.log 
+ URL=http://10.249.1.121:9200
++ awk '{print $1}'
++ curl -s http://10.249.1.121:9200/_cat/templates

No change in the masters log. Still reporting:

[2019-09-13T23:45:35,852][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s      

The Kibana web page is showing a status of RED.

Status Breakdown

|ID|Status|
| --- | --- |
|ui settings|Elasticsearch plugin is red|
|plugin:kibana@5.5.2|Ready|
|plugin:elasticsearch@5.5.2|Elasticsearch is still initializing the kibana index.|
|plugin:metrics@5.5.2|Ready|
|plugin:timelion@5.5.2|Ready|

Data0

If I start 1 (of 3) cluster servers.

Everything looks good in the cluster

[2019-09-13T23:48:20,184][INFO ][o.e.t.TransportService   ] [elasticsearch_data/0] publish_add
ress {10.249.1.124:9300}, bound_addresses {0.0.0.0:9300}                                      
[2019-09-13T23:48:20,200][INFO ][o.e.b.BootstrapChecks    ] [elasticsearch_data/0] bound or pu
blishing to a non-loopback or non-link-local address, enforcing bootstrap checks              
[2019-09-13T23:48:50,234][WARN ][o.e.n.Node               ] [elasticsearch_data/0] timed out w
hile waiting for initial discovery state - timeout: 30s                                       
[2019-09-13T23:48:50,245][INFO ][o.e.h.n.Netty4HttpServerTransport] [elasticsearch_data/0] pub
lish_address {10.249.1.124:9200}, bound_addresses {0.0.0.0:9200}                              
[2019-09-13T23:48:50,245][INFO ][o.e.n.Node               ] [elasticsearch_data/0] started    
[2019-09-13T23:48:54,621][INFO ][o.e.c.s.ClusterService   ] [elasticsearch_data/0] detected_ma
ster {elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10
.249.1.121:9300}, added {{ingestor/0}{aXP_QjwMT2qND61qqpBN_A}{MMWR6-tgRN-5dliZUGs72Q}{10.249.1
.128}{10.249.1.128:9300},{elasticsearch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9e
sg}{10.249.1.121}{10.249.1.121:9300},{kibana/0}{iSU1a7nRQXaJQVlMSsLbqA}{TPcWJyC-T6mI3lOVLrRRQg
}{10.249.1.127}{10.249.1.127:9300},}, reason: zen-disco-receive(from master [master {elasticse
arch_master/0}{7teHHoZbS7C-p45hBlnuKg}{v9veDuo7SMqKf6ZWmi9esg}{10.249.1.121}{10.249.1.121:9300
} committed version [893]])    

Continuing in Reply due to size limitations

Masters log

[2019-09-13T23:49:36,998][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]                               
        at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_141]                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$o
nTimeout$1(ClusterService.java:254) ~[elasticsearch-5.5.2.jar:5.5.2]                          
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.ru
n(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]                                     
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:
1.8.0_141]                                                                                    
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:
1.8.0_141]                                                                                    
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]                                
[2019-09-13T23:49:37,000][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, p
arams: {name=filebeat-6.1.2}                                                                  
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process clus
ter event (create-index-template [filebeat-6.1.2], cause [api]) within 30s                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$n
ull$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.2]                               
        at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_141]                    
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$o
nTimeout$1(ClusterService.java:254) ~[elasticsearch-5.5.2.jar:5.5.2]                          
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.ru
n(ThreadContext.java:569) [elasticsearch-5.5.2.jar:5.5.2]                                     
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:
1.8.0_141]                                                                                    
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:
1.8.0_141]                                                                                    
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]                                
[2019-09-13T23:49:37,245][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch_master/0] [gc][75
4] overhead, spent [305ms] collecting in the last [1.1s]                   

I am hoping someone can tell me the following:

  1. Who is supposed to be creating the index, I have 9 VMs, not sure which one is responsible for it
  2. If it is a timeout, is the master trying to reach a different VM that is not up? If so, which one?
  3. Could it be a permissions issue on some drive somewhere that is leading to the failure on creating the index (system has run for years unchanged)
  4. Could it be a port issue on some drive somewhere that is leading to the failure on creating the index (system has run for years unchanged and is scripted)

Basically at this point I am hoping someone can direct me to look at something else to try to track this down.

If you need any additional information (or full logs) please just ask.

Thank you for any responses, much appreciated.
David

This seems to indicate that updating and propagating the cluster state is taking far to long and therefore timing out. This can happen for a number of reasons. As the cluster has been operational for a long time without issues and you have not done anything to it it is possible that the amount of indices and data has grown over time and that either the cluster state has gotten too large or some hodes are struggling with heap preassure.

It would help if you can post the full output of the cluster stats API.

The warning is about creating an index template, not an index, and I think it's being created by an external filebeat and not by Elasticsearch itself. I would expect that this template already exists, but filebeat is issuing another create request to make sure anyway. I also imagine that this is timing out while in the queue because the master is busy doing something else, because creating a template should be a pretty quick operation.

No, the master will only be communicating with nodes that are running (or at least, were running very recently).

I doubt that would result in a timeout.

I don't know what you mean by "port issue on a drive" but it's possible that slow IO on one or more nodes could cause this timeout if that's what you mean.

In addition to @Christian_Dahlqvist's suggestion to share the cluster stats output, I would suggest looking at GET _cluster/pending_tasks to see what is in the master's task queue. It is possible that it is overloaded by tasks resulting from the restart, particularly if this cluster is not often restarted.

I appreciate the responses.

Since this has been running a long time and now has suddenly failed, so based on your responses I thought it might be related to underpowered VMs.

We were using a m3.medium (3.75 GiB RAM, 1 vCPU), so I have recreated it as m3.large (7.5 GiB RAM, 2 vCPUs).

The 3 data clusters had 388 GB of data before I re-created them.
We have around 700 different machines feeding in via filebeat. Probably each machine has at least 7 files each.

Looks like the server is up and running again now, there are still a massive number of pending tasks, but I am pretty sure it will catch up.

Thank you so much for the direction.

Here is the requested information.

$ curl http://10.249.1.121:9200/_cluster/health

{
	"cluster_name": "logsearch",
	"status": "green",
	"timed_out": false,
	"number_of_nodes": 6,
	"number_of_data_nodes": 3,
	"active_primary_shards": 37,
	"active_shards": 74,
	"
	relocating_shards ":2,"
	initializing_shards ":0,"
	unassigned_shards ":0,"
	delayed_unassigned_shards ":0,"
	number_of_pending_tasks ":1144,"
	number_of_in_flight_fetch ": 0,
	"task_max_waiting_in_queue_millis": 317937,
	"active_shards_percent_as_number": 100.0
}

$ curl http://10.249.1.121:9200/_cluster/stats

   {
   	"_nodes": {
   		"total": 6,
   		"successful": 6,
   		"failed": 0
   	},
   	"cluster_name": "logsearch",
   	"timestamp": 1568507572097,
   	"status": "green",
   	"indices": {
   		"count": 11,
   		"shards": {
   			"tota
   			l ":74,"
   			primaries ":37,"
   			replication ":1.0,"
   			index ":{"
   			shards ":{"
   			min ":2,"
   			max ":10,"
   			avg ":6.7272727272727275},"
   			primaries ":{"
   			min ":1,"
   			max ":5,"
   			avg ":3.3636363636363638},
   			"replication": {
   				"min": 1.0,
   				"max": 1.0,
   				"avg": 1.0
   			}
   		}
   	},
   	"docs": {
   		"count": 2814782,
   		"deleted": 0
   	},
   	"store": {
   		"size_in_bytes": 3683627954,
   		"throttle_time_in_millis": 0
   	},
   	"fie
   	lddata ":{"
   	memory_size_in_bytes ":0,"
   	evictions ":0},"
   	query_cache ":{"
   	memory_size_in_bytes ":992,"
   	total_count ":339,"
   	hit_count ":146,"
   	miss_count ":193,"
   	cache_size ":
   	35,
   	"cache_count": 35,
   	"evictions": 0
   }, "completion": {
   	"size_in_bytes": 0
   }, "segments": {
   	"count": 300,
   	"memory_in_bytes": 12099179,
   	"terms_memory_in_bytes": 9347448,
   	"sto
   	red_fields_memory_in_bytes ":1466280,"
   	term_vectors_memory_in_bytes ":0,"
   	norms_memory_in_bytes ":55296,"
   	points_memory_in_bytes ":151987,"
   	doc_values_memory_in_by
   	tes ":1078168,"
   	index_writer_memory_in_bytes ":0,"
   	version_map_memory_in_bytes ":0,"
   	fixed_bit_set_memory_in_bytes ":0,"
   	max_unsafe_auto_id_timestamp ":156850735812
   	1,
   	"file_sizes": {}
   }
   }, "nodes": {
   	"count": {
   		"total": 6,
   		"data": 3,
   		"coordinating_only": 2,
   		"master": 1,
   		"ingest": 3
   	},
   	"versions": ["5.5.2"],
   	"os": {
   		"available_processors": 12,
   		"allocated_processors": 12,
   		"names": [{
   			"name": "Linux",
   			"count": 6
   		}],
   		"mem": {
   			"total_in_bytes": 39212023808,
   			"free_in_bytes": 8356122624,
   			"used_in_bytes": 30855901184,
   			"
   			free_percent ":21,"
   			used_percent ":79}},"
   			process ":{"
   			cpu ":{"
   			percent ":116},"
   			open_file_descriptors ":{"
   			min ":238,"
   			max ":1511,"
   			avg ":485}},"
   			jvm ":{"
   			max_uptime_in_milli
   			s ":1862445,"
   			versions ":[{"
   			version ":"
   			1.8 .0 _141 ","
   			vm_name ":"
   			OpenJDK 64 - Bit Server VM ","
   			vm_version ":"
   			25.141 - b15 ","
   			vm_vendor ":"
   			Oracle Corporation ","
   			count ":6}],"
   			mem ":{"
   			heap_used_in_bytes ":4304717672,"
   			heap_max_in_bytes ":17939300352},"
   			threads ":275},"
   			fs ":{"
   			total_in_bytes ":1565499260928,"
   			free_in_bytes ":1560262987776,"
   			a
   			vailable_in_bytes ":1480598691840},"
   			plugins ":[],"
   			network_types ":{"
   			transport_types ":{"
   			netty4 ":6},"
   			http_types ":{"
   			netty4 ":6
		}
	}
}

$ curl http://10.249.1.121:9200/_cluster/pending_tasks

I stopped the output, there are at least 5372 tasks pending.
Here is one of them

{
	"insert_order": 16603,
	"priority": "URGENT",
	"source": "create-index-template [filebeat-6.1.2], cause [api]",
	"executing": true,
	"time_in_queue_millis": 29857,
	"time_in_queue": "29.8s"
}

David

After restarting the master with a larger VM, the cluster is staying RED

$ curl http://10.249.1.121:9200/_cluster/health?pretty=true

{
  "cluster_name" : "logsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 80,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 604,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 3754049,
  "active_shards_percent_as_number" : 0.0
}

Before upping the master to a 4 CPU version, the 2 CPUs were maxed out and the pending tasks had climbed to 32,000.

Now at 4 CPUs, it appears to be holding its own. Hovering somewhere around the 600 mark. The master is still reporting the 30 second time out though

[2019-09-15T21:52:04,744][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, params: {name=filebeat-6.1.2}                                 
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [filebeat-6.1.2], cause [api
]) within 30s                                                                                                                                              
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.
2]                                                                                                                                                         
        at java.util.ArrayList.forEach(ArrayList.java:1249) ~[?:1.8.0_141]   

What does this mean:

  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 80,

Why would there be unassigned shards, but no active shards?
Which VM is responsible for the shards?

1. elasticsearch_data/2 (46178f6d-d405-42e1-8e97-b67dcccf5d85)
2. ingestor/0 (cdac6d90-2a1f-40a1-9dbd-e3d604d414c5)
3. cluster_monitor/0 (213d90a9-28dd-40c0-8ff1-3f53e327cb07)
4. elasticsearch_data/1 (6f0ae4fd-d450-45ef-9fad-d9ba2f90afa3)
5. ls-router/0 (60a7d5a9-b050-47fd-8849-198ca0c1606c)
6. elasticsearch_data/0 (d4264c8e-5fc6-42ea-9ce0-6371eb006767)
7. maintenance/0 (b4af6d64-1318-42a9-a1fd-1e38fddaf84f)
8. kibana/0 (b9eece03-d7b1-4d5a-a2fe-39ad1c137d66)
9. elasticsearch_master/0 (8f040e57-b775-4cf8-83a5-06dda399b6ab)

Thanks,
David

Some additional output, had to remove some to get to 7000 bytes for posting.

$ curl http://10.249.1.121:9200/_cat/shards

filebeat-6.1.2-2019.09.11             2 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.11             2 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.11             1 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.11             1 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.11             0 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.11             0 r UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 2 p UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 2 r UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 4 p UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 4 r UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 3 p UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 3 r UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 1 p UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 1 r UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 0 p UNASSIGNED                                                                                                       
logs-%{[@metadata][index]}-2019.09.13 0 r UNASSIGNED                                                                                                       
.kibana                               0 p UNASSIGNED                                                                                                       
.kibana                               0 r UNASSIGNED                   
filebeat-6.1.2-2019.09.09             2 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.09             2 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.09             1 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.09             1 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.09             0 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.09             0 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             2 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             2 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             1 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             1 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             0 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.15             0 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             2 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             2 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             1 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             1 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             0 p UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.13             0 r UNASSIGNED                                                                                                       
filebeat-6.1.2-2019.09.12             2 p UNASSIGNED
filebeat-6.1.2-2019.09.12             2 r UNASSIGNED
filebeat-6.1.2-2019.09.12             1 p UNASSIGNED
filebeat-6.1.2-2019.09.12             1 r UNASSIGNED
filebeat-6.1.2-2019.09.12             0 p UNASSIGNED
filebeat-6.1.2-2019.09.12             0 r UNASSIGNED
filebeat-6.1.2-2019.09.10             2 p UNASSIGNED
filebeat-6.1.2-2019.09.10             2 r UNASSIGNED
filebeat-6.1.2-2019.09.10             1 p UNASSIGNED
filebeat-6.1.2-2019.09.10             1 r UNASSIGNED
filebeat-6.1.2-2019.09.10             0 p UNASSIGNED
filebeat-6.1.2-2019.09.10             0 r UNASSIGNED
filebeat-6.1.2-2019.09.14             2 p UNASSIGNED
filebeat-6.1.2-2019.09.14             2 r UNASSIGNED
filebeat-6.1.2-2019.09.14             1 p UNASSIGNED
filebeat-6.1.2-2019.09.14             1 r UNASSIGNED
filebeat-6.1.2-2019.09.14             0 p UNASSIGNED
filebeat-6.1.2-2019.09.14             0 r UNASSIGNED

Looking into this more, since the master keeps reporting:

[2019-09-16T01:47:28,870][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, params: {name=filebeat-6.1.2}                                 
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [filebeat-6.1.2], cause [api
]) within 30s   

I had a good look:

$ curl http://10.249.1.121:9200/_template/filebeat-6.1.2

After I run that json through JSONLint, it is 65,000 lines long.

So I looked into changing the timeout (v 5.5.2), having a hard time finding documentation that is explicit. Every says set the "request_timeout", but no one says where.

I edited this file:
elasticsearch/config/elasticsearch.yml

All settings in that file are commented out (I assume we are using all defaults).

I have tried adding the following to the bottom of that file, and restarted elasticsearch (master):
vim /var/vcap/packages/elasticsearch/config/elasticsearch.yml

elasticsearch.requestTimeout: 60000                                                                                                                                     
elasticsearch.shardTimeout: 60000     

request_timeout: 60000                                                                                                                                     
shard_timeout: 60000     

The error message still says 30s

We deploy via bosh, so if modifying this temporarily works, and someone knows what to change in bosh which is far more complicated, I would also like to know that.

TIA,
David

It's a request parameter, not something you can set server-wide.

Can you share the whole list of pending tasks? Use https://gist.github.com/ since it'll be too long for a post here.

For some reason githubgist kept timing out on me.
Here is a previous pending tasks from yesterday, this one was handy.

Thanks, that works too.

$ cat pending.json | grep source | sort | uniq -c
   3       "source": "_add_listener_",
   1       "source": "cluster_reroute(async_shard_fetch)",
 533       "source": "create-index-template [filebeat-6.1.2], cause [api]",
   6       "source": "put-mapping",
   1       "source": "put-pipeline-filebeat-6.1.1-iot-log4j-pipeline",
316087       "source": "shard-failed",

Oh dear, yes, this is a bit of a problem. Can you stop filebeat from trying to create this template so very persistently? All 500+ create-index-template tasks were added in the last 30 seconds. and although each task shouldn't take too long, the master cannot keep up with 20 of them every second. These are swamping the master and so it cannot get around to the 300,000+-and-counting (lower-priority) shard-failed tasks.

I should add that Elasticsearch deals with shard-failed tasks more efficiently in versions after 6.4.0 thanks to #31313.

1 Like

I didn't use the correct Reply button, but have posted the pending tasks via Firefox Send.

Of my around 940 (last list), there are roughly 440 each of these 2 pending tasks:

{                                                                                                                                                      
  "insert_order" : 14791,                                                                                                                              
  "priority" : "URGENT",                                                                                                                               
  "source" : "create-index-template [filebeat-6.1.2], cause [api]",                                                                                    
  "executing" : true,                                                                                                                                  
  "time_in_queue_millis" : 29857,                                                                                                                      
  "time_in_queue" : "29.8s"                                                                                                                            
},                         

And

{                                                                                                                                                      
  "insert_order" : 351,                                                                                                                                
  "priority" : "HIGH",                                                                                                                                 
  "source" : "shard-failed",                                                                                                                           
  "executing" : false,                                                                                                                                 
  "time_in_queue_millis" : 635754,                                                                                                                     
  "time_in_queue" : "10.5m"                                                                                                                            
},                  

Is there anyway I can increase this timeout somewhere?

As soon as I start up the master I get a lot of these:

[2019-09-16T12:37:37,124][INFO ][o.e.b.BootstrapChecks    ] [elasticsearch_master/0] bound or publishing to a non-loopback or non-link-local address, enfor
cing bootstrap checks                                                                                                                                      
         
[2019-09-16T12:37:40,241][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.12][0] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.12][0]], allocation id [bkR8Ld0HSWS8fmUT-splyg], primary term [13], message [mark copy as stale]                                     
[2019-09-16T12:37:40,241][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.08][2] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.08][2]], allocation id [fA7z17-jQMK79VQAieFT3Q], primary term [17], message [mark copy as stale]                                     
[2019-09-16T12:37:40,241][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.08][1] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.08][1]], allocation id [y0P95RMmQM-dTeGT9EGtww], primary term [19], message [mark copy as stale]                                     
[2019-09-16T12:37:40,241][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.12][1] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.12][1]], allocation id [C8V-1TT4SgmLNGgjQai8Gw], primary term [19], message [mark copy as stale]                                     

Followed almost shortly by:

[2019-09-16T12:38:20,265][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.08][1] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.08][1]], allocation id [y0P95RMmQM-dTeGT9EGtww], primary term [20], message [mark copy as stale]
[2019-09-16T12:38:20,273][WARN ][o.e.c.a.s.ShardStateAction] [elasticsearch_master/0] [filebeat-6.1.2-2019.09.13][1] received shard failed for shard id [[f
ilebeat-6.1.2-2019.09.13][1]], allocation id [rTHrdboqQxmnEX7M3U3aLw], primary term [8], message [mark copy as stale]
[2019-09-16T12:38:20,642][WARN ][r.suppressed             ] path: /_template/filebeat-6.1.2, params: {name=filebeat-6.1.2}
org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (create-index-template [filebeat-6.1.2], cause [api
]) within 30s
        at org.elasticsearch.cluster.service.ClusterService$ClusterServiceTaskBatcher.lambda$null$0(ClusterService.java:255) ~[elasticsearch-5.5.2.jar:5.5.
2]

I have also upped my JVM options to:

-Xms8g                                                                                                                                                     
-Xmx8g   

No seeming effect.

David

I am not sure how to do that.

I have inherited this setup, and have about 1,400 VMs running filebeat feading Elasticsearch. So I don't know how to tell it to stop. Would that have to be something I run on each of the 1,400 VMs? Or is it something I can do server side?

Is there anything I can do, ES VM side of things (increasing RAM, CPUs and so on) to try to get by this to get the production server up and then possibly look into upgrading to a newer version of ES/Kibana?

Your responses are greatly appreciated.
David

You could try a firewall rule on all your Elasticsearch nodes blocking access to port 9200 (by the looks of your earlier logs) from most of the VMs. It doesn't have to be watertight as long as it gets enough of them.

I'm not too familiar with filebeat but it looks like you perhaps should set setup.template.enabled: false on your monitored VMs. Auto-creation of the template seems helpful when you've only got a few VMs but I don't think it works so well with 1400 nodes all competing to set the template up.

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-template.html

Good suggestion, will give it a try.

I used the firewall rule and cut off the 1400 filebeat instances.
This allowed the master to catch up.
Essentially it was running with 20 pending tasks and the "task_max_waiting_in_queue_millis" was not climbing.

I then enabled 700 of instances and was hovering at:

 "number_of_pending_tasks" : 49,                                                                                                                                                                                                                         
  "task_max_waiting_in_queue_millis" : 29941,  

But that has now climbed to:

  "number_of_pending_tasks" : 158,                                                                                                                         
  "task_max_waiting_in_queue_millis" : 25214,  

And I still haven't enabled the additional 700 feeds.

Suggestions?

If I were to upgrade to Elasticsearch 7.3 (from 5.5.2), I assume it is not backwards compatible and I will have to make a lot of configuration changes. Can you confirm?</forward thinking>

Thanks,
David

Ok I guess those nodes are still hammering the master with create-template requests. Try reconfiguring them with setup.template.enabled: false so they stop.

Yes, upgrading from here will be a bit of a mission. You should first upgrade to 5.6.16 and use its upgrade assistant and deprecation logging to identify all the things that need adjusting before upgrading to 6.x. Then upgrade to the latest 6.x (6.8.3 at time of writing) and do the same to prepare for 7.x. At the very least you will need to delete or re-index any indices created in 5.x before you can upgrade to 7.x, but there will be other things too.