Fleet UI is unusable due to timeout errors

ES, Kibana, Fleet Server: 7.17.
Elastic Agent versions: 7.17 & 7.15.2

I've been using Fleet for a few months on version 7.15.2 & then 7.16 and things were working great, but after upgrade to 7.17, it's unusable. Cannot open the UI, keep getting "request timed out" for all the pages. If somehow it opens after dozens of attempts, then the APIs will timeout instead.

It's not related to package registry, since I'm able to access Integrations page instantly. Every other aspect of our ELK stack works perfectly smooth. Only facing this issue with Fleet, due to which we are seriously considering moving to manual Beats collection instead. Since that comes with a lot of maintenance for upgrades, we would still prefer using Fleet if someone could help out with this timeout issue.

Timeouts for all pages:


image

Policies don't get updated, be it small, big, new or old:


image

Is it related to this? @ruflin

Seeing the same errors on my agents, they are not able to connect to fleet server. There was no proxy or networking related change since the upgrade, it was able to connect since the start.

elastic_agent logs from one of the out-of date machines:

[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping

[elastic_agent][error] Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://x.x:8220/api/fleet/agents/3b8ec162-4a14-4e8c-8899-96d1b4a33513/checkin?": EOF

@marclop Could the experimental: true setting that you had suggested cause this by any chance?

Hi @Yash_Gill,

Sorry to see you all are having trouble with Fleet after upgrading to 7.17.0. Based on the info you shared, it does seem your installation is having trouble installing the MongoDB package before your reverse proxy hits a 2 minute timeout. This same issue is likely affecting both the setup API and the package_policies APIs.

We do have some work to do on making package installs faster, but I'm not sure the issues that you are encountering here are the same ones. On my 8.1.1 cluster, I can install the same package in about 15s (without any associated integration policies). A few questions / suggestions:

  • Which version of MongoDB do you currently have installed (you can see in Integrations app -> Installed Packages -> MongoDB -> Settings)?
  • If this is currently < 1.3.1, can you upgrade the package with the Upgrade policies checkbox unchecked? If that works successfully, you will need to upgrade each individual integration policy (see the Integration policies tab of the MongoDB page). This may fix loading the Fleet app.
  • How many MongoDB integration policies do you have?
  • Is upgrading to 8.1.1 an option? We have fixed a few performance related items in later releases which may help resolve the issue. We also execute Fleet's setup and upgrade process in the background so it shouldn't be affected by any reverse proxy timeouts any more. 8.2 will also include some further performance fixes once that is released.
  • I'm not sure about the Fleet Server checkin errors you are seeing, but it is possible that fixing the MongoDB package upgrade issue may resolve that.

@joshdover Thanks for the prompt reply.

To reply to your questions/points

  • Which version of MongoDB do you currently have installed?
    • 1.3.1
  • How many MongoDB integration policies do you have?
    • 2 (both 1.3.1)
  • Is upgrading to 8.1.1 an option?
    • Yes, we are planning an upgrade but probably after a few weeks/months. Is this timeout a known issue affecting <8.x versions?
  • I'm not sure about the Fleet Server checkin errors you are seeing, but it is possible that fixing the MongoDB package upgrade issue may resolve that.
    • Actually, this doesn't just affect Mongo integration, it's the same across every package, even something as basic as Windows integration also times out.

Is there any config change that could fix/workaround this? I wasn't able to find this timeout issue across forums or in GitHub issues.

Well this is certainly odd then that Fleet is attempting to upgrade this package. I'm not aware of any related known issue. I think there may be some other issue happening here. Can we try this?

  • First, enable debug Fleet logs in your Kibana configuration:
    logging:
      loggers:
        - name: plugins.fleet
          level: debug
    
  • Restart Kibana
  • Next, manually reinstall the package with this API call (this should not do anything destructive):
    curl -XPOST --url <kibana domain>/api/fleet/epm/packages/mongodb-1.3.1 -H 'content-type: application/json' -H 'kbn-xsrf: 123' -d'{"force":true}' -u elastic:<password>
    
  • Attempt to open Fleet UI again
  • Share whatever Fleet debug logs you see that may be helpful if the issue persists.

That should have no effect on Fleet and its responsiveness since that setting only affects the Elasticsearch output in APM Server.

Thanks for the clarification, @marclop.
Just out of curiosity, I removed experimental: true in Fleet Settings and all the out-of-date policy issues went away. If I added it back, they come back again. Removed again and policies updated instantly again. Not sure why this would happen.

Thanks @joshdover, I removed the MongoDB integration from both policies and tried out what you suggested. I added the debug logger for fleet plugin in kibana.yml but don't seem to see anything special in the logs after restarting kibana. Could you give an example of a log point? I'll grep that and send the file here.

Also tried the API call you suggested:
curl -XPOST --url <kibana domain>/api/fleet/epm/packages/mongodb-1.3.1 -H 'content-type: application/json' -H 'kbn-xsrf: 123' -d'{"force":true}' -u elastic:<password>
Response
{"statusCode":503,"error":"Service Unavailable","message":"Request timed out"}

When loading Fleet UI, that same API as a GET gets made and it completes instantly:


image
image

Although, if I try the same thing by clicking "Install MongoDB Assets", I get the following response:


{"statusCode":500,"error":"Internal Server Error","message":"Cannot find asset mongodb-1.3.1/kibana/dashboard/mongodb-Metrics-MongoDB.json"}

Tried something different, increased the timeout configs in kibana.yml drastically and was able to get it to install the assets as well as add the plugin in the required policy.
Took 4.5 mins to install MongoDB.

There's a new problem though, unable to get any of the Mongo logs or metrics into ES, getting this in logs-* when I filter for mongo:
Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc089246eb1873026, ext:873610542553, loc:(*time.Location)(0x56435c979100)}, Meta:{"raw_index":"logs-mongodb.log-mongodb_servers"}, Fields:{"agent":{"ephemeral_id":"bd68227e-9f51-4745-9a21-bab39e0135f0","hostname":"x.x.x.com","id":"46f0d193-0e17-4388-a18f-4c525a0d650f","name":"x.x.x.com","type":"filebeat","version":"7.15.2"},"data_stream":{"dataset":"mongodb.log","namespace":"mongodb_servers","type":"logs"},"ecs":{"version":"1.11.0"},"elastic_agent":{"id":"46f0d193-0e17-4388-a18f-4c525a0d650f","snapshot":false,"version":"7.15.2"},"event":{"dataset":"mongodb.log"},"host":{"architecture":"x86_64","containerized":false,"hostname":"x.x.x.com","id":"b2b676aeef00467186bfc275f8b20d2b","ip":["x.x.x.x","..."],"mac":["..."],"name":"x.x.x.com","os":{"codename":"focal","family":"debian","kernel":"5.4.0-80-generic","name":"Ubuntu","platform":"ubuntu","type":"linux","version":"20.04.2 LTS (Focal Fossa)"}},"input":{"type":"log"},"log":{"file":{"path":"/var/log/mongodb/mongod.log"},"offset":270444},"message":"{\"t\":{\"$date\":\"2021-06-22T06:57:20.217Z\"},\"s\":\"I\", \"c\":\"NETWORK\", \"id\":51800, \"ctx\":\"conn66\",\"msg\":\"client metadata\",\"attr\":{\"remote\":\"x.x.x.x:56972\",\"client\":\"conn66\",\"doc\":{\"driver\":{\"name\":\"mongo-java-driver|legacy\",\"version\":\"3.10.2\"},\"os\":{\"type\":\"Linux\",\"name\":\"Linux\",\"architecture\":\"amd64\",\"version\":\"5.4.0-74-generic\"},\"platform\":\"Java/AdoptOpenJDK/11.0.11+9\"}}}","tags":["mongodb-logs"]}, Private:file.State{Id:"native::46792787-2306", PrevId:"", Finished:false, Fileinfo:(*os.fileStat)(0xc0008c5d40), Source:"/var/log/mongodb/mongod.log", Offset:270825, Timestamp:time.Time{wall:0xc089246eaace0488, ext:873497744001, loc:(*time.Location)(0x56435c979100)}, TTL:-1, Type:"log", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x2ca0053, Device:0x902}, IdentifierName:"native"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=403): {"type":"security_exception","reason":"action [indices:data/write/bulk[s]] is unauthorized for API key id [x] of user [elastic/fleet-server] on indices [logs-mongodb.log-mongodb_servers,.ds-logs-mongodb.log-mongodb_servers-2022.03.30-000001], this action is granted by the index privileges [create_doc,create,delete,index,write,all]"}, dropping event!

Not sure why I'd get
action [indices:data/write/bulk[s]] is unauthorized for API key id of user [elastic/fleet-server] on indices [logs-mongodb.log-mongodb_servers,.ds-logs-mongodb.log-mongodb_servers-2022.03.30-000001], this action is granted by the index privileges [create_doc,create,delete,index,write,all

I installed the integration as recommended, all the necessary assets (index templates, data streams, indices, etc.) are present in ES.

Any idea what could be wrong?

That is certainly odd. There is a bug in the 7.17.1 APM Server that severely limits the throughput that you get using it, but it is surprising that it would affect fleet in any way.

One possibility would be that (if not in 7.17.1) the Elasticsearch cluster is under pressure by the amount of data that is being ingested and that is causing other calls to Elasticsearch to time out, which may in turn bubble up through Kibana, but I don't have any visibility into your Elasticsearch deployment or its current health and metrics.

If you have them or are willing to enable stack monitoring and explore this issue further we could certainly have a look and investigate what's happening.

@marclop
I can definitely share any monitoring metrics you need, have everything enabled.
But there's really no change as such, we've been ingesting pretty much the same amount of data for months and ES was operates at much lower than its overall capacity. For instance, we had enabled trace logs in our apps once and our ingestion went up by 4x for a week last month, but ES was still easily able to handle it.

Also, I was so focused on getting the mongodb package installed from local environment that I completely missed mentioning that removing the experimental: true setting fixes all timeouts in UI.
Able to access the Fleet UI over the internet instantly. Back to how it was in the previous version (7.15.2 & 7.16), no lag or timeouts for any Fleet pages or APIs.

I understand that technically that config shouldn't play a factor at all, but I tried it again and I do get timeouts on enabling it.

@Yash_Gill Would you mind opening an issue in the APM Server GitHub repository (Sign in to GitHub · GitHub). If you don't mind working with us while we try to determine what could be wrong.

It would be good to have the exact Elasticsearch, Kibana, Elastic Agent / Fleet and APM Server versions, as well as the configurations with any sensitive information scrubbed or anonymized.

Sure @marclop, sounds good, will create a GH issue. Should it be for the experimental: true timeout issue as well? Because I doubt a lot of people will be using that anyway, seems like an edge case.

The main issue that's remaining for me right now is for the mongodb integration specifically where the logs aren't being indexed in ES despite all the assets being installed through the Integration.
action [indices:data/write/bulk[s]] is unauthorized for API key id of user [elastic/fleet-server] on indices [logs-mongodb.log-mongodb_servers,.ds-logs-mongodb.log-mongodb_servers-2022.03.30-000001], this action is granted by the index privileges [create_doc,create,delete,index,write,all

Could you tag someone working on that package/area here as well while I create an issue?

This is very surprising to me. You're saying that removing the experimental: true flag from the Elasticsearch output settings resolves the mongodb package timeout problems too, or just the out of date agent issues?

I wonder if this could be related to the experimental: true flag as well. What could be needed is to force a policy update on the affected agents which would require re-generating API keys. A simple way to do this would be to add a new integration policy (of a different package, say nginx) to these agents, wait for them to finish updating, and then remove the integration policy.

If that doesn't work, re-enrolling the agents should fix the problem (but that's not very nice of a solution).

Hi @joshdover, not wise to do it on prod but did it during low traffic and was able to replicate this multiple times.

Removing experimental: true fixes the out of date agent issues, was still getting timeout but as shared in the SS, it's because the mongodb package took 4.5 mins. to install. Kibana never waited for that long, it would close the request before that. So increasing the elasticsearch.requestTimeout to 10 mins. in kibana.yml allowed it to complete.

Thanks a lot for the suggestions:

  • Adding a new integration policy (of a different package, say nginx) to these agents, wait for them to finish updating, and then remove the integration policy

    • This did not work
  • If that doesn't work, re-enrolling the agents should fix the problem

    • This fixed the index privileges issue!

But I guess this mongodb package really doesn't want me to use it. I'm able to get it to work for my local Mongo instance but getting these kinds of error logs for prod (redacted some info):
Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Date(2022, time.March, 31, 14, 20, 36, 105483398, time.Local), Meta:{"raw_index":"logs-mongodb.log-production_linux_servers"}, Fields:{"agent":{"ephemeral_id":"bxxxxe7a-46c0-48cd-aca1-1543caa99661","hostname":"lbox-0","id":"82c349e4-74c4-4a33-8d6a-fd65c150bb8a","name":"lbox-0","type":"filebeat","version":"7.17.0"},"data_stream":{"dataset":"mongodb.log","namespace":"production_linux_servers","type":"logs"},"ecs":{"version":"1.12.0"},"elastic_agent":{"id":"8xxxxxe4-7xx4-4a33-8d6a-fdxxxxxbb8a","snapshot":false,"version":"7.17.0"},"event":{"dataset":"mongodb.log"},"host":{"architecture":"x86_64","containerized":false,"hostname":"lbox-0","id":"45eb2df2882c4780af7ccc66d6d30f21","ip":["19.x.x.x","fxx0::2xx:5xxx:fxxx:3xxx"],"mac":["0x:85:4d:45:55:53"],"name":"lbox-0","os":{"codename":"bionic","family":"debian","kernel":"4.15.0-151-generic","name":"Ubuntu","platform":"ubuntu","type":"linux","version":"18.04.4 LTS (Bionic Beaver)"}},"input":{"type":"log"},"log":{"file":{"path":"/var/log/mongodb/mongod.log"},"offset":341386656},"message":"2022-03-31T21:20:35.778Z I NETWORK [conn27270860] received client metadata from 192.1xx.2xx.1xx:5xxx4 conn27270860: { driver: { name: \"mongo-csharp-driver\", version: \"2.8.1.0\" }, os: { type: \"Windows\", name: \"Microsoft Windows 10.0.17763\", architecture: \"x86_64\", version: \"10.0.17763\" }, platform: \".NET Framework 4.8.4470.0\" }","tags":["mongodb-logs"]}, Private:file.State{Id:"native::3539860-2050", PrevId:"", Finished:false, Fileinfo:(*os.fileStat)(0xc000856680), Source:"/var/log/mongodb/mongod.log", Offset:341386986, Timestamp:time.Date(2022, time.March, 31, 14, 17, 56, 905013440, time.Local), TTL:-1, Type:"log", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x360394, Device:0x802}, IdentifierName:"native"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"illegal_argument_exception","reason":"data stream timestamp field [@timestamp] is missing"}}, dropping event!

Created issues for different integrations

@marclop @joshdover

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.