Elasticsearch stops after running for a week

I am using ELK stack (Elasticsearch and Kibana ) in a docker setup to monitor one of our application which is running in docker. This is a dedicated machine. I use filebeats to send logs to the ELK. Filebeats also run in docker. Whenever there is a new deployment the filebeat docker container is taken down and starts up again ( running file beat setup, filebeat run ). It works for a week or so then the Kibana interface goes down. It happened like four times now.

Here is the logs from Elasticsearch. Once these exceptions are seen there are no logs from Kibana container.

elasticsearch    | {"type": "server", "timestamp": "2020-09-01T09:11:31,191Z", "level": "DEBUG", "component": "o.e.a.s.TransportSearchAction", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "[.kibana_task_manager_1][0], node[jVcP3kfJSdG7RezbDK7-yg], [P], s[STARTED], a[id=WkViw7IdQsC2EVFcS1SeWQ]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={}}] lastShard [true]", "cluster.uuid": "tDA17sj0SxOuNyBSrmphBg", "node.id": "jVcP3kfJSdG7RezbDK7-yg" , 
elasticsearch    | "stacktrace": ["org.elasticsearch.transport.TransportException: failure to send",
elasticsearch    | "at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:660) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:704) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService.sendChildRequest(TransportService.java:696) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchTransportService.sendExecuteQuery(SearchTransportService.java:138) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:79) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:231) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:668) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.finishAndRunNext(AbstractSearchAsyncAction.java:662) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNext(AbstractSearchAsyncAction.java:640) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNext(AbstractSearchAsyncAction.java:632) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:68) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:238) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:45) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:29) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:68) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:36) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:394) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService$6.handleResponse(TransportService.java:633) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1163) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1241) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1221) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:47) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:30) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:416) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:695) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
elasticsearch    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
elasticsearch    | "at java.lang.Thread.run(Thread.java:832) [?:?]",
elasticsearch    | "Caused by: org.elasticsearch.tasks.TaskCancelledException: The parent task was cancelled, shouldn't start any child tasks",
elasticsearch    | "at org.elasticsearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:521) ~[elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.tasks.TaskManager.registerChildNode(TaskManager.java:201) ~[elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:627) ~[elasticsearch-7.8.1.jar:7.8.1]",
elasticsearch    | "... 31 more"] }
elasticsearch    | {"type": "server", "timestamp": "2020-09-01T09:11:31,194Z", "level": "DEBUG", "component": "o.e.a.a.c.n.t.c.TransportCancelTasksAction", "cluster.name": "docker-cluster", "node.name": "elasticsearch", "message": "Removing ban for the parent [jVcP3kfJSdG7RezbDK7-yg:405287] on the node [jVcP3kfJSdG7RezbDK7-yg]", "cluster.uuid": "tDA17sj0SxOuNyBSrmphBg", "node.id": "jVcP3kfJSdG7RezbDK7-yg"  }

Here is the logs from filebeat container trying to start up.

2020-09-09T06:45:36.642Z INFO instance/beat.go:647 Home path: [/usr/share/filebeat] Config path: [/usr/share/filebeat] Data path: [/usr/share/filebeat/data] Logs path: [/usr/share/filebeat/logs] 2020-09-09T06:45:36.655Z INFO instance/beat.go:655 Beat ID: 49e4b497-d5eb-49ab-bdb4-40bdb42d6ca8 2020-09-09T06:45:36.656Z INFO [beat] instance/beat.go:983 Beat info {"system_info": {"beat": {"path": {"config": "/usr/share/filebeat", "data": "/usr/share/filebeat/data", "home": "/usr/share/filebeat", "logs": "/usr/share/filebeat/logs"}, "type": "filebeat", "uuid": "49e4b497-d5eb-49ab-bdb4-40bdb42d6ca8"}}} 2020-09-09T06:45:36.656Z INFO [beat] instance/beat.go:992 Build info {"system_info": {"build": {"commit": "94f7632be5d56a7928595da79f4b829ffe123744", "libbeat": "7.8.1", "time": "2020-07-21T15:12:45.000Z", "version": "7.8.1"}}} 2020-09-09T06:45:36.656Z INFO [beat] instance/beat.go:995 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":4,"version":"go1.13.10"}}} 2020-09-09T06:45:36.662Z INFO [beat] instance/beat.go:999 Host info {"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-08-26T11:33:11Z","containerized":true,"name":"cc58258203f1","ip":["127.0.0.1/8","172.18.0.2/16"],"kernel_version":"4.15.0-112-generic","mac":["02:42:ac:12:00:02"],"os":{"family":"redhat","platform":"centos","name":"CentOS Linux","version":"7 (Core)","major":7,"minor":8,"patch":2003,"codename":"Core"},"timezone":"UTC","timezone_offset_sec":0,"id":"1a018e03a49f4bfc904c69b0d6c08959"}}} 2020-09-09T06:45:36.663Z INFO [beat] instance/beat.go:1028 Process info {"system_info": {"process": {"capabilities": {"inheritable":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"permitted":null,"effective":null,"bounding":["chown","dac_override","fowner","fsetid","kill","setgid","setuid","setpcap","net_bind_service","net_raw","sys_chroot","mknod","audit_write","setfcap"],"ambient":null}, "cwd": "/usr/share/filebeat", "exe": "/usr/share/filebeat/filebeat", "name": "filebeat", "pid": 1, "ppid": 0, "seccomp": {"mode":"filter","no_new_privs":false}, "start_time": "2020-09-09T06:45:35.930Z"}}} 2020-09-09T06:45:36.663Z INFO instance/beat.go:310 Setup Beat: filebeat; Version: 7.8.1 2020-09-09T06:45:36.664Z INFO [index-management] idxmgmt/std.go:184 Set output.elasticsearch.index to 'filebeat-7.8.1' as ILM is enabled. 2020-09-09T06:45:36.664Z INFO eslegclient/connection.go:99 elasticsearch url: http://10.200.160.5:9200 2020-09-09T06:45:36.665Z INFO [publisher] pipeline/module.go:113 Beat name: cc58258203f1 2020-09-09T06:45:36.670Z INFO eslegclient/connection.go:99 elasticsearch url: http://10.200.160.5:9200 2020-09-09T06:45:36.680Z INFO [esclientleg] eslegclient/connection.go:314 Attempting to connect to Elasticsearch version 7.8.1 Overwriting ILM policy is disabled. Setsetup.ilm.overwrite:true for enabling. 2020-09-09T06:45:36.696Z INFO [index-management] idxmgmt/std.go:261 Auto ILM enable success. 2020-09-09T06:45:36.698Z INFO [index-management.ilm] ilm/std.go:139 do not generate ilm policy: exists=true, overwrite=false 2020-09-09T06:45:36.698Z INFO [index-management] idxmgmt/std.go:274 ILM policy successfully loaded. 2020-09-09T06:45:36.698Z INFO [index-management] idxmgmt/std.go:407 Set setup.template.name to '{filebeat-7.8.1 {now/d}-000001}' as ILM is enabled. 2020-09-09T06:45:36.698Z INFO [index-management] idxmgmt/std.go:412 Set setup.template.pattern to 'filebeat-7.8.1-*' as ILM is enabled. 2020-09-09T06:45:36.698Z INFO [index-management] idxmgmt/std.go:446 Set settings.index.lifecycle.rollover_alias in template to {filebeat-7.8.1 {now/d}-000001} as ILM is enabled. 2020-09-09T06:45:36.699Z INFO [index-management] idxmgmt/std.go:450 Set settings.index.lifecycle.name in template to {filebeat {"policy":{"phases":{"hot":{"actions":{"rollover":{"max_age":"30d","max_size":"50gb"}}}}}}} as ILM is enabled. 2020-09-09T06:45:36.701Z INFO template/load.go:169 Existing template will be overwritten, as overwrite is enabled. 2020-09-09T06:45:37.081Z INFO [add_cloud_metadata] add_cloud_metadata/add_cloud_metadata.go:93 add_cloud_metadata: hosting provider type detected as openstack, metadata={"availability_zone":"zone-2","instance":{"id":"i-0001d713","name":"xxxxxxxxxxxxxxxxx.novalocal"},"machine":{"type":"R1-Generic-4"},"provider":"openstack"} 2020-09-09T06:45:37.116Z INFO template/load.go:109 Try loading template filebeat-7.8.1 to Elasticsearch 2020-09-09T06:45:37.313Z INFO template/load.go:101 template with name 'filebeat-7.8.1' loaded. 2020-09-09T06:45:37.313Z INFO [index-management] idxmgmt/std.go:298 Loaded index template. 2020-09-09T06:45:37.316Z ERROR instance/beat.go:958 Exiting: resource 'filebeat-7.8.1' exists, but it is not an alias Exiting: resource 'filebeat-7.8.1' exists, but it is not an alias

Can someone tell me what is going on ?

What is the output from the _cluster/stats endpoint on Elasticsearch?

:face_with_hand_over_mouth: I restarted the whole stack. I guess I can only get it next time. :frowning:

I restarted the ELK stack but the filebeat container cannot connect to it. Here is the log. (I will have to clean up the container docker rm ... to get it back working)
Update: Removing the container doesn't work.

2020-09-10T09:17:15.045Z	ERROR	[publisher_pipeline_output]	pipeline/output.go:155	Failed to connect to backoff(elasticsearch(http://10.200.160.5:9200)): Connection marked as failed because the onConnect callback failed: resource 'filebeat-7.8.1' exists, but it is not an alias
2020-09-10T09:17:15.045Z	INFO	[publisher_pipeline_output]	pipeline/output.go:146	Attempting to reconnect to backoff(elasticsearch(http://10.200.160.5:9200)) with 24 reconnect attempt(s)
2020-09-10T09:17:15.046Z	INFO	[publisher]	pipeline/retry.go:221	retryer: send unwait signal to consumer
2020-09-10T09:17:15.046Z	INFO	[publisher]	pipeline/retry.go:225	  done
2020-09-10T09:17:15.048Z	INFO	[esclientleg]	eslegclient/connection.go:314	Attempting to connect to Elasticsearch version 7.8.1
2020-09-10T09:17:15.064Z	INFO	[license]	licenser/es_callback.go:51	Elasticsearch license: Basic
2020-09-10T09:17:15.064Z	INFO	[index-management]	idxmgmt/std.go:261	Auto ILM enable success.
2020-09-10T09:17:15.066Z	INFO	[index-management.ilm]	ilm/std.go:139	do not generate ilm policy: exists=true, overwrite=false
2020-09-10T09:17:15.066Z	INFO	[index-management]	idxmgmt/std.go:274	ILM policy successfully loaded.
2020-09-10T09:17:15.067Z	INFO	[index-management]	idxmgmt/std.go:407	Set setup.template.name to '{filebeat-7.8.1 {now/d}-000001}' as ILM is enabled.
2020-09-10T09:17:15.067Z	INFO	[index-management]	idxmgmt/std.go:412	Set setup.template.pattern to 'filebeat-7.8.1-*' as ILM is enabled.
2020-09-10T09:17:15.067Z	INFO	[index-management]	idxmgmt/std.go:446	Set settings.index.lifecycle.rollover_alias in template to {filebeat-7.8.1 {now/d}-000001} as ILM is enabled.
2020-09-10T09:17:15.067Z	INFO	[index-management]	idxmgmt/std.go:450	Set settings.index.lifecycle.name in template to {filebeat {"policy":{"phases":{"hot":{"actions":{"rollover":{"max_age":"30d","max_size":"50gb"}}}}}}} as ILM is enabled.
2020-09-10T09:17:15.069Z	INFO	template/load.go:89	Template filebeat-7.8.1 already exists and will not be overwritten.
2020-09-10T09:17:15.069Z	INFO	[index-management]	idxmgmt/std.go:298	Loaded index template.
2020-09-10T09:17:28.559Z	INFO	[monitoring]	log/log.go:145	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":273940,"time":{"ms":63}},"total":{"ticks":585770,"time":{"ms":168},"value":585770},"user":{"ticks":311830,"time":{"ms":105}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":14},"info":{"ephemeral_id":"ba58c703-ce7e-4051-aa80-1c53e3852003","uptime":{"ms":161340067}},"memstats":{"gc_next":46513360,"memory_alloc":23958984,"memory_total":14923099080},"runtime":{"goroutines":3723}},"filebeat":{"events":{"active":16,"added":16},"harvester":{"files":{"3b9b24b8-b868-4d91-bed6-389e6e9f1227":{"last_event_published_time":"2020-09-10T09:17:21.516Z","last_event_timestamp":"2020-09-10T09:17:15.069Z","read_offset":4754,"size":1304}},"open_files":2,"running":2}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":5,"events":{"active":873,"published":16,"retry":1,"total":16}}},"registrar":{"states":{"current":154}},"system":{"load":{"1":0.05,"15":0.42,"5":0.26,"norm":{"1":0.0125,"15":0.105,"5":0.065}}}}}}

_cluster/stats

{"_nodes":{"total":1,"successful":1,"failed":0},"cluster_name":"docker-cluster","cluster_uuid":"tDA17sj0SxOuNyBSrmphBg","timestamp":1599729696458,"status":"yellow","indices":{"count":42,"shards":{"total":42,"primaries":42,"replication":0.0,"index":{"shards":{"min":1,"max":1,"avg":1.0},"primaries":{"min":1,"max":1,"avg":1.0},"replication":{"min":0.0,"max":0.0,"avg":0.0}}},"docs":{"count":285617,"deleted":10},"store":{"size_in_bytes":92261459},"fielddata":{"memory_size_in_bytes":0,"evictions":0},"query_cache":{"memory_size_in_bytes":36824,"total_count":11,"hit_count":2,"miss_count":9,"cache_size":1,"cache_count":1,"evictions":0},"completion":{"size_in_bytes":0},"segments":{"count":16,"memory_in_bytes":113656,"terms_memory_in_bytes":91024,"stored_fields_memory_in_bytes":9872,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":9984,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":2776,"index_writer_memory_in_bytes":0,"version_map_memory_in_bytes":0,"fixed_bit_set_memory_in_bytes":240,"max_unsafe_auto_id_timestamp":-1,"file_sizes":{}},"mappings":{"field_types":[{"name":"binary","count":3,"index_count":1},{"name":"boolean","count":28,"index_count":4},{"name":"date","count":53,"index_count":8},{"name":"flattened","count":1,"index_count":1},{"name":"float","count":5,"index_count":1},{"name":"geo_shape","count":1,"index_count":1},{"name":"integer","count":26,"index_count":2},{"name":"keyword","count":369,"index_count":9},{"name":"long","count":80,"index_count":6},{"name":"nested","count":9,"index_count":3},{"name":"object","count":245,"index_count":9},{"name":"text","count":174,"index_count":8}]},"analysis":{"char_filter_types":[],"tokenizer_types":[],"filter_types":[],"analyzer_types":[],"built_in_char_filters":[],"built_in_tokenizers":[],"built_in_filters":[],"built_in_analyzers":[]}},"nodes":{"count":{"total":1,"coordinating_only":0,"data":1,"ingest":1,"master":1,"ml":1,"remote_cluster_client":1,"transform":1,"voting_only":0},"versions":["7.8.1"],"os":{"available_processors":4,"allocated_processors":4,"names":[{"name":"Linux","count":1}],"pretty_names":[{"pretty_name":"CentOS Linux 7 (Core)","count":1}],"mem":{"total_in_bytes":8363581440,"free_in_bytes":2704494592,"used_in_bytes":5659086848,"free_percent":32,"used_percent":68}},"process":{"cpu":{"percent":0},"open_file_descriptors":{"min":395,"max":395,"avg":395}},"jvm":{"max_uptime_in_millis":1166465,"versions":[{"version":"14.0.1","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"14.0.1+7","vm_vendor":"AdoptOpenJDK","bundled_jdk":true,"using_bundled_jdk":true,"count":1}],"mem":{"heap_used_in_bytes":367445504,"heap_max_in_bytes":1073741824},"threads":62},"fs":{"total_in_bytes":41442029568,"free_in_bytes":32589520896,"available_in_bytes":32572743680},"plugins":[],"network_types":{"transport_types":{"security4":1},"http_types":{"security4":1}},"discovery_types":{"single-node":1},"packaging_types":[{"flavor":"default","type":"docker","count":1}],"ingest":{"number_of_pipelines":1,"processor_stats":{"gsub":{"count":0,"failed":0,"current":0,"time_in_millis":0},"script":{"count":0,"failed":0,"current":0,"time_in_millis":0}}}}}