Since Elastic-Stack 8.8.0 we observe an issue that is reproducible when navigating fleet and especially modifying integrations where the Elastic-Stack Cluster stalls out due to high CPU. In the following screenshot I've tried to add the system integration and the Elasticsearch process goes to 100% CPU load for several minutes.
7 Elasticsearch instances
2 Kibana instances
2 Fleet instances
AMD EPYC 7313
8TB NVME SSD
128 GB Memory
It seems as there is a bug in a component but i would need help to track it down.
It seems as the Warm and Cold nodes are hit by this issue. We have about 50 Elastic-Agents with about 20 Policies.
What I've found in the kibana log
2023-06-23T16:37:15.944604+02:00 XXX kibana: [2023-06-23T16:37:15.944+02:00][ERROR][plugins.taskManager] [WorkloadAggregator]: ResponseError: search_phase_execution_exception
2023-06-23T16:37:15.944672+02:00 XXX kibana: #011Root causes:
2023-06-23T16:37:15.944716+02:00 XXX kibana: #011#011parse_exception: operator not supported for date math [+12500ms]
Elasticsearch logfile doesn't seem to have anything interesting.
Furthermore, there seems to be an issue with the persistence of the integrations. It's not clear when this happens but the integrations can be installed and all the visualizations show correctly on the dashboards. Then after some time the dashboard stops working and the following errors show up:
After reinstalling the integration it seems to work again until it stops working. not sure when this happens. I currently suspect the following
- installing another integration
- restarting kibana process
Trying to force a reinstall with
"error": "Internal Server Error",
I can confirm that after restarting kibana the data view relationship breaks and the integration dashboards stop working
Integrations are randomly reinstalled in other Kibana Spaces. The random reinstall fixes the data view not found issue in the space where the integration was randomly installed but this is not a solution as there are situations where the dashboards should only belong to certain Kibana Spaces
When reinstalling the integration, it seems to work in the choosen space until kibana is restated then the dashbaords only work in another random space.
There were old objects from 2022 that were installed with past integrations (windows, system) but not cleaned up with newer version. It seems as these artefacts were not used anymore, however it also seems as they were not cleaned because they were not used by anything.
Complete removal of an integration and reinstall doesn't fix the issue
- Windows 1.24.0
- System 1.34.0
- Cisco ISE 1.9.0
Glad to help if any more information is required. PITA to work with the integrations when suddendly everything starts to break and customer are on the line.
An idea: Could there be an interferrence when a data view already exist with the same name as the integration would use?
Integration in Space "S"
Data View in Space "S"
Kibana Space "L" Saved Objects: Many with the same name
Furthermore, it seems as the memory usage starts to spike when working with Kibana specially when using Elastic Fleet and Integration Page. The client response time seems very high too. There are 3 Kibana nodes but none is under load. We've increased the Memory for Kibana from 4GB to 8 GB and see peaks around 5.5 GB of memory usage (single user using fleet/integration). As you can see in the graph, after the peaks there is no user activity anymore. Not sure if it is expected to use so much memory.
Upgrading to Elastic-Stack 8.8.2 resolved the high CPU issue
we have the same problem. Duplicated Data view and Dashboard could not locate it. Problems started at Version 8.8.0
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.