Hello there!
Unfortunately, this is a very reoccurring error in Elastic Cloud, I found people have been having this error many times.
I have an Elastic Cloud account. I onboard agents, and everything goes well, I can see my APM services, my metrics and my logs coming in, but after a day or so, the agent becomes unhealthy, with no changes made to either the Elastic Deployment or the agent itself, it's really frustrating.
Elastic Cloud Deployment version: 8.3.3
Elastic Agent version: 8.3.3
Here are some logs from the agent:
[elastic_agent][error] apm-server stderr: "sync.runtime_Semacquire(0xc0008f84e0?)\n\truntime/sema.go:56 +0x25\nsync.(*WaitGroup).Wait(0xc00021da98?)\n\tsync/waitgroup.go:136 +0x52\ngolang.org/x/sync/errgroup.(*Group).Wait(0xc00012dac0)\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:52 +0x27\ngithub.com/elastic/apm-server/beater.server.run({0xc00098d500, 0xc000614c00, {{0x55a2f81ab478, 0xc000a5e0a8}, {0x55a2f81b1ab8, 0xc000a5e090}, 0x6fc23ac00, 0xc0007d60a0, 0xc00070e900}, 0xc00012d740, ...}, ...)\n\tgithub.com/elastic/apm-server/beater/server.go:248 +0x425\ngithub.com/elastic/apm-server/beater.newBaseRunServer.func1({_, _}, {{{0x55a2f73177b5, 0xa}, {0x55a2f73177b5, 0xa}, {0x55a2f730d2a3, 0x5}, 0x1, {0xc000052ee8, ...}, ...}, ...})\n\tgithub.com/elastic/apm-server/beater/server.go:109 +0x169\ngithub.com/elastic/apm-server/beater.WrapRunServerWithProcessors.func1({_, _}, {{{0x55a2f73177b5, 0xa}, {0x55a2f73177b5, 0xa}, {0x55a2f730d2a3, 0x5}, 0x1, {0xc000052ee8, ...}, ...}, ...})\n\tgithub.com/elastic/apm-server/beater/beater.go:913 +0x16d\nmain.runServerWithProcessors.func3()\n\tgithub.com/elastic/apm-server/x-pack/apm-server/main.go:206 +0x63\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:71 +0xa5\n\ngoroutine 235 [select]:\n"
20:56:29.157
elastic_agent
[elastic_agent][error] apm-server stderr: "github.com/elastic/apm-server/agentcfg.Reporter.Run({{0x55a2f81ab478, 0xc000a5e0a8}, {0x55a2f81b1ab8, 0xc000a5e090}, 0x6fc23ac00, 0xc0007d60a0, 0xc00070e900}, {0x55a2f81cb6c8?, 0xc00012da80})\n\tgithub.com/elastic/apm-server/agentcfg/reporter.go:75 +0x232\ngithub.com/elastic/apm-server/beater.server.run.func1()\n\tgithub.com/elastic/apm-server/beater/server.go:237 +0x3d\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:71 +0xa5\n\ngoroutine 236 [IO wait, 7 minutes]:\ninternal/poll.runtime_pollWait(0x7f638d577418, 0x72)\n\truntime/netpoll.go:302 +0x89\ninternal/poll.(*pollDesc).wait(0xc0001d6e00?, 0x0?, 0x0)\n\tinternal/poll/fd_poll_runtime.go:83 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:88\ninternal/poll.(*FD).Accept(0xc0001d6e00)\n\tinternal/poll/fd_unix.go:614 +0x22c\nnet.(*netFD).accept(0xc0001d6e00)\n\tnet/fd_unix.go"
20:56:29.157
elastic_agent
[elastic_agent][error] apm-server stderr: ":172 +0x35\nnet.(*TCPListener).accept(0xc000a5e750)\n\tnet/tcpsock_posix.go:139 +0x28\nnet.(*TCPListener).Accept(0xc000a5e750)\n\tnet/tcpsock.go:288 +0x3d\nnet/http.(*Server).Serve(0xc00056e000, {0x55a2f81c9e98, 0xc000a5e750})\n\tnet/http/server.go:3039 +0x385\ngithub.com/elastic/apm-server/beater.(*httpServer).start(0xc00012d740)\n\tgithub.com/elastic/apm-server/beater/http.go:108 +0x2a5\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:71 +0xa5\n\ngoroutine 237 [chan receive, 7 minutes]:\ngithub.com/elastic/gmux.(*chanListener).Accept(0x55a2f81dae00"
20:56:29.157
elastic_agent
[elastic_agent][error] apm-server stderr: "?)\n\tgithub.com/elastic/gmux@v0.2.0/conn.go:100 +0x28\ngoogle.golang.org/grpc.(*Server).Serve(0xc000000f00, {0x55a2f81bcc98?, 0xc000a5e960})\n\tgoogle.golang.org/grpc@v1.48.0/server.go:790 +0x477\ngithub.com/elastic/apm-server/beater.server.run.func2()\n\tgithub.com/elastic/apm-server/beater/server.go:240 +0x29\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:71 +0xa5\n\ngoroutine 238 [chan receive, 7 minutes]:\ngithub.com/elastic/apm-server/beater.server.run.func3()\n\tgithub.com/elastic/apm-server/beater/server.go:243 +0x2e\ngolang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:74 +0x64\ncreated by golang.org/x/sync/errgroup.(*Group).Go\n\tgolang.org/x/sync@v0.0.0-20220513210516-0976fa681c29/errgroup/errgroup.go:71 +0xa5\n"
20:56:29.157
elastic_agent
[elastic_agent][error] apm-server stderr: "\ngoroutine 94 [select, 7 minutes]:\ngithub.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run(0xc0006f2410)\n\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722214259-1755b5dd3127/libbeat/publisher/pipeline/client_worker.go:123 +0x97\ncreated by github.com/elastic/beats/v7/libbeat/publisher/pipeline.makeClientWorker\n\tgithub.com/elastic/beats/v7@v7.0.0-alpha2.0.20220722214259-1755b5dd3127/libbeat/publisher/pipeline/client_worker.go:76 +0x2a5\n\ngoroutine 524 [IO wait]:\ninternal/poll.runtime_pollWait(0x7f638d577238, 0x72)\n\truntime/netpoll.go:302 +0x89\n"
20:56:29.157
elastic_agent
[elastic_agent][error] apm-server stderr: "internal/poll.(*pollDesc).wait(0xc000be0680?, 0xc000baf271?, 0x0)\n\tinternal/poll/fd_poll_runtime.go:83 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:88\ninternal/poll.(*FD).Read(0xc000be0680, {0xc000baf271, 0x1, 0x1})\n\tinternal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc000be0680, {0xc000baf271?, 0xc000bea0d8?, 0xc000856f68?})\n\tnet/fd_posix.go:55 +0x29\nnet.(*conn).Read(0xc00011a190, {0xc000baf271?, 0x0?, 0x38?})\n\tnet/net.go:183 +0x45\nnet/http.(*connReader).backgroundRead(0xc000baf260)\n\tnet/http/server.go:672 +0x3f\ncreated by net/http.(*connReader).startBackgroundRead\n\tnet/http/server.go:668 +0xca\n\ngoroutine 552 [IO wait]:\ninternal/poll.runtime_pollWait(0x7f638d577058, 0x72)\n\truntime/netpoll.go:302 +0x89\ninternal/poll.(*pollDesc).wait(0xc000be0800?, 0xc000baf4e1?, 0x0)\n\tinternal/poll/fd_poll_runtime.go"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: ":83 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:88\ninternal/poll.(*FD).Read(0xc000be0800, {0xc000baf4e1, 0x1, 0x1})\n\tinternal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc000be0800, {0xc000baf4e1?, 0xc000bea218?, 0xc000855768?})\n\tnet/fd_posix.go:55 +0x29\nnet.(*conn).Read(0xc000a08668, {0xc000baf4e1?, 0xc0006f2190?, 0x8000080008?})\n\tnet/net.go:183 +0x45\nnet/http.(*connReader).backgroundRead(0xc000baf4d0)\n\tnet/http/server.go:672 +0x3f\ncreated by net/http.(*connReader).startBackgroundRead\n\tnet/http/server.go:668 +0xca"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: "\n"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: ":83 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:88\ninternal/poll.(*FD).Read(0xc000be0800, {0xc000baf4e1, 0x1, 0x1})\n\tinternal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc000be0800, {0xc000baf4e1?, 0xc000bea218?, 0xc000855768?})\n\tnet/fd_posix.go:55 +0x29\nnet.(*conn).Read(0xc000a08668, {0xc000baf4e1?, 0xc0006f2190?, 0x8000080008?})\n\tnet/net.go:183 +0x45\nnet/http.(*connReader).backgroundRead(0xc000baf4d0)\n\tnet/http/server.go:672 +0x3f\ncreated by net/http.(*connReader).startBackgroundRead\n\tnet/http/server.go:668 +0xca"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: "\n"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: ":83 +0x32\ninternal/poll.(*pollDesc).waitRead(...)\n\tinternal/poll/fd_poll_runtime.go:88\ninternal/poll.(*FD).Read(0xc000be0800, {0xc000baf4e1, 0x1, 0x1})\n\tinternal/poll/fd_unix.go:167 +0x25a\nnet.(*netFD).Read(0xc000be0800, {0xc000baf4e1?, 0xc000bea218?, 0xc000855768?})\n\tnet/fd_posix.go:55 +0x29\nnet.(*conn).Read(0xc000a08668, {0xc000baf4e1?, 0xc0006f2190?, 0x8000080008?})\n\tnet/net.go:183 +0x45\nnet/http.(*connReader).backgroundRead(0xc000baf4d0)\n\tnet/http/server.go:672 +0x3f\ncreated by net/http.(*connReader).startBackgroundRead\n\tnet/http/server.go:668 +0xca"
20:56:29.158
elastic_agent
[elastic_agent][error] apm-server stderr: "\n"
20:57:11.638
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to: 'degraded'
20:57:11.638
elastic_agent
[elastic_agent][info] 2022-08-17T18:57:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
20:57:11.638
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to: 'degraded'
20:57:11.638
elastic_agent
[elastic_agent][info] 2022-08-17T18:57:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
20:57:11.638
elastic_agent
[elastic_agent][warn] Elastic Agent status changed to: 'degraded'
20:57:11.638
elastic_agent
[elastic_agent][info] 2022-08-17T18:57:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'
20:58:11.643
elastic_agent
[elastic_agent][error] Elastic Agent status changed to: 'error'
20:58:11.643
elastic_agent
[elastic_agent][error] Elastic Agent status changed to: 'error'
20:58:11.643
elastic_agent
[elastic_agent][error] Elastic Agent status changed to: 'error'
20:58:11.644
elastic_agent
[elastic_agent][error] 2022-08-17T18:58:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
20:58:11.644
elastic_agent
[elastic_agent][error] 2022-08-17T18:58:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
20:58:11.644
elastic_agent
[elastic_agent][error] 2022-08-17T18:58:11Z - message: Application: apm-server--8.3.3[24e48d9b-fb94-4ec5-bd88-45f1a9659f76]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'
23:15:57.399
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: status code: 0, fleet-server returned an error: , message: Post "https://10.11.148.50:18011/api/fleet/agents/24e48d9b-fb94-4ec5-bd88-45f1a9659f76/checkin": read tcp 172.17.0.2:39780->10.11.148.50:18011: read: connection timed out
23:15:57.399
elastic_agent
[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: status code: 0, fleet-server returned an error: , message: Post "https://10.11.148.50:18011/api/fleet/agents/24e48d9b-fb94-4ec5-bd88-45f1a9659f76/checkin": read tcp 172.17.0.2:39780->10.11.148.50:18011: read: connection timed out