Hi @dkow,
Thanks for your help! Here are the answers to your questions and I have run the diagnostic tool and sent the output to eck@elastic.co.
How did you confirm that logs are missing? Is it a single log now and then or all logs during a particular time window?
We confirmed that the logs are missing because we are transitioning from the AWS managed Elasticsearch to using ECK. When we moved our alerts over we noticed that the alerts on the new system didn't pick up all of the events that where happening on the old system. At first I had thought that the alert wasn't working but when we looked further into it the logs that the alerts need to trigger were missing. These alerts do work sometimes (when there are logs for them). It doesn't appear to be during a particular time window, it looks like random chunks that are missing.
Can you identify a single Elastic Agent where you can confirm that some logs were dropped and identify when it occured? If yes, then do the following:
exec into the Pod: kubectl exec -it your_pod_name bash
go to log path for your namespace: cd state/data/logs/default
in that directory you should find logs for the Filebeat process that is responsible for gathering and sending the logs
Can you see anything suspicious there? Connection failures, errors, etc.? If yes, can you share them here?
{"log.level":"warn","@timestamp":"2021-10-03T21:03:10.474Z","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":405},"message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc04ea483659f756d, ext:243062640032189, loc:(*time.Location)(0x5575af4ac320)}, Meta:{\"raw_index\":\"logs-generic-default\"}, Fields:{\"agent\":{\"ephemeral_id\":\"eba9f2cc-130f-42bd-af29-666265978c55\",\"hostname\":\"elastic-agent-agent-zvx4d\",\"id\":\"2f577b92-cf36-4fe8-944b-b17e0a473799\",\"name\":\"elastic-agent-agent-zvx4d\",\"type\":\"filebeat\",\"version\":\"7.14.1\"},\"cloud\":{\"account\":{\"id\":\"221837593202\"},\"availability_zone\":\"ap-southeast-2a\",\"image\":{\"id\":\"ami-04b1878ebf78f7370\"},\"instance\":{\"id\":\"i-0bad5d941b2efd7e6\"},\"machine\":{\"type\":\"m5a.large\"},\"provider\":\"aws\",\"region\":\"ap-southeast-2\",\"service\":{\"name\":\"EC2\"}},\"data_stream\":{\"dataset\":\"generic\",\"namespace\":\"default\",\"type\":\"logs\"},\"ecs\":{\"version\":\"1.10.0\"},\"elastic_agent\":{\"id\":\"2f577b92-cf36-4fe8-944b-b17e0a473799\",\"snapshot\":false,\"version\":\"7.14.1\"},\"event\":{\"dataset\":\"generic\"},\"host\":{\"architecture\":\"x86_64\",\"containerized\":true,\"hostname\":\"elastic-agent-agent-zvx4d\",\"id\":\"94ce1221fac96aa24ab7224bba617b9b\",\"ip\":[\"100.100.206.206\",\"fe80::649c:c9ff:febb:52a8\"],\"mac\":[\"66:9c:c9:bb:52:a8\"],\"name\":\"elastic-agent-agent-zvx4d\",\"os\":{\"codename\":\"Core\",\"family\":\"redhat\",\"kernel\":\"5.8.0-1041-aws\",\"name\":\"CentOS Linux\",\"platform\":\"centos\",\"type\":\"linux\",\"version\":\"7 (Core)\"}},\"input\":{\"type\":\"log\"},\"kubernetes\":{\"container\":{\"id\":\"b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d\",\"image\":\"k8s.gcr.io/ingress-nginx/controller:v0.47.0@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b\",\"name\":\"controller\",\"runtime\":\"containerd\"},\"namespace\":\"production\",\"pod\":{\"ip\":\"100.100.206.232\",\"labels\":{\"app\":{\"kubernetes\":{\"io/component\":\"controller\",\"io/instance\":\"ingress-nginx-alb-production\",\"io/name\":\"ingress-nginx\"}},\"pod-template-hash\":\"5cd45d569d\"},\"name\":\"ingress-nginx-alb-production-controller-5cd45d569d-9w9rt\",\"uid\":\"26316f49-966b-4f1f-9307-627b64654a12\"}},\"log\":{\"file\":{\"path\":\"/var/log/containers/ingress-nginx-alb-production-controller-5cd45d569d-9w9rt_production_controller-b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d.log\"},\"offset\":6027608},\"message\":\"2021-10-03T21:03:08.273466192Z stderr F 2021/10/03 21:03:08 [error] 83#83: *3967636 upstream sent too big header while reading response header from upstream, client: 185.191.171.15, server: mp.natlib.govt.nz, request: \\\"GET /headings?il%5Bsubject%5D=Smoking\\u0026il%5Byear%5D=1882 HTTP/1.1\\\", upstream: \\\"http://100.100.206.212:3000/headings?il%5Bsubject%5D=Smoking\\u0026il%5Byear%5D=1882\\\", host: \\\"mp.natlib.govt.nz\\\"\"}, Private:file.State{Id:\"native::514952-66305\", PrevId:\"\", Finished:false, Fileinfo:(*os.fileStat)(0xc0041b4d00), Source:\"/var/log/containers/ingress-nginx-alb-production-controller-5cd45d569d-9w9rt_production_controller-b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d.log\", Offset:6028012, Timestamp:time.Time{wall:0xc04e9ef076086a3b, ext:237354915346089, loc:(*time.Location)(0x5575af4ac320)}, TTL:-1, Type:\"log\", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x7db88, Device:0x10301}, IdentifierName:\"native\"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [kubernetes.pod.labels.app] of type [keyword] in document with id 'ABD3R3wBvGqm6X9qvHl6'. Preview of field's value: '{kubernetes={io/instance=ingress-nginx-alb-production, io/component=controller, io/name=ingress-nginx}}'\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Can't get text on a START_OBJECT at 1:526\"}}","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-10-03T21:03:11.633Z","log.logger":"kubernetes","log.origin":{"file.name":"add_kubernetes_metadata/matchers.go","file.line":91},"message":"Error extracting container id - source value does not contain matcher's logs_path '/var/lib/docker/containers/'.","service.name":"filebeat","ecs.version":"1.6.0"}
Can you confirm that there are no intermittent networking issues in your cluster? Couple of errors that you've posted look like there are some, eg. etcdserver: request timed out, error: fail to checkin to fleet-server
Unfortunately we do have intermittent networking issues in our cluster that we are trying to solve. They are mainly DNS related issues.
Kibana 401/500 errors are curious and I'm not sure why they only happen temporarily, but if all your Agents are ultimately healthy in Fleet UI it would seem that it's not the culprit here.
Can you double check that Fleet Settings in Fleet UI (Fleet Server hosts and Elasticsearch hosts) are set correctly, to addresses routable from your Elastic Agents?
Yes they are correct.
Can you run our diagnostics tool and share the results? You can either share them here or send to eck@elastic.co. It will not contain any Secrets, but it will have your ECK-managed resources and their logs, among other things.
Hope that helps!
Richard