Requests to Elasticsearch 8 in ECK freeze for an abnormally long time (caused by Kibana?)

Hi. We are currently working on implementing an Elastic Stack-based SIEM in Kubernetes, and I switched from standalone ES 7 to ECK+ES 8 for easier orchestration. ECK configuration was done entirely in accordance with the quick start guide, without any deviations. For storage, I use an NFS-based PVC - probably far from optimal, but it worked fine for ES 7. However, after setting up ES 8 I am facing a baffling problem.

First, compared to standalone ES 7 without security features (RBAC + TLS), requests to ES 8 take noticeably longer: 1-2 seconds for each request, compared to under 1 second for ES 7. This is not a resource problem - ES8 pods are deployed with the same resources as ES 7 pods were.

Second, when first starting the cluster, everything seems to be fine at first (increased delays notwithstanding), but after approximately 1 minute, some PUT/POST requests to ES freeze completely, taking from 2 minutes to 10 minutes (!) to complete (this seems to mostly affect requests to add/delete ILM policies, roles, and users). The request is hard to diagnose due to TLS, but listening with tcpdump seems to suggest that the request is acknowledged by ES before the connection freezes; furthermore, not even TCP keepalives seem to be sent at the time (the keepalive settings are set to defaults; the keepalive interval is 75 s on both ends of the connection). At the time the problem is present, CPU usage by ES pods is also between 200-500m, well under the resource limit.

As long as such a request is frozen, similar concurrent requests seem to usually freeze as well. However, GET requests, as well as malformed POST requests get a response immediately. Checking cluster health at the time does not seem to show anything notable - here is the output I got at the time the benchmark requests were completely frozen:

user@k8snode01:~/$ curl -k -u elastic:$PASSWORD https://10.233.121.37:9200/_cluster/health?pretty
{
  "cluster_name" : "elasticsearch-test-eck",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 9,
  "active_shards" : 9,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
user@k8snode01:~/$ curl -k -u elastic:$PASSWORD https://10.233.121.37:9200/_cluster/pending_tasks?pretty
{
  "tasks" : [ ]
}
user@k8snode01:~/$ curl -k -u elastic:$PASSWORD https://10.233.121.37:9200/_cat/shards?v'&'s=index
index                               shard prirep state   docs   store ip            node
.apm-agent-configuration            0     p      STARTED    0    225b 10.233.121.37 elasticsearch-test-eck-es-default-0
.apm-custom-link                    0     p      STARTED    0    225b 10.233.121.37 elasticsearch-test-eck-es-default-0
.ds-ilm-history-5-2023.03.27-000001 0     p      STARTED    6  18.4kb 10.233.121.37 elasticsearch-test-eck-es-default-0
.geoip_databases                    0     p      STARTED   42  40.6mb 10.233.121.37 elasticsearch-test-eck-es-default-0
.kibana-event-log-8.6.2-000001      0     p      STARTED    1     7kb 10.233.121.37 elasticsearch-test-eck-es-default-0
.kibana_8.6.2_001                   0     p      STARTED  251   2.6mb 10.233.121.37 elasticsearch-test-eck-es-default-0
.kibana_security_session_1          0     p      STARTED    0    225b 10.233.121.37 elasticsearch-test-eck-es-default-0
.kibana_task_manager_8.6.2_001      0     p      STARTED   27 188.3kb 10.233.121.37 elasticsearch-test-eck-es-default-0
.security-7                         0     p      STARTED  109   592kb 10.233.121.37 elasticsearch-test-eck-es-default-0

After around 10-20 minutes the problem seems to disappear, and further requests go mostly without incident. If the cluster is restarted after this point (without deleting the data / PVCs), the problem does not seem to repeat. The problem was reproduced with clusters of 1 and 3 nodes.
This is a huge problem, as you can imagine. Beyond the obvious inconvenience and a tenfold increase in deployment time, it also causes our initialization scripts to time out.

After testing this for a while, I found that the problem does not seem to surface when Kibana (ECK-based) is not deployed, and in fact seems to appear after I deploy it. Kibana logins are also affected - logins with incorrect credentials get refused immediately, but logins with correct ones freeze and eventually fail with HTTP 504 (gateway timeout) behind the scenes.

Below are some of log excerpts for a 1-node cluster. I used a makeshift curl-based benchmark to reproduce the problem and get proper timings by creating and deleting a user role (benchmark-test-role) and measuring the start/end time:

Run 013: creating @ 10:10:59+00, HTTP 200, 1 s; deleting @ 10:11:00+00, HTTP 200, 1 s; stopped @ 10:11:01+00
Run 014: creating @ 10:11:02+00, HTTP 200, 2 s; deleting @ 10:11:04+00, HTTP 200, 1 s; stopped @ 10:11:05+00
Run 015: creating @ 10:11:06+00, HTTP 200, 1 s; deleting @ 10:11:07+00, HTTP 200, 1 s; stopped @ 10:11:08+00
Run 016: creating @ 10:11:09+00, HTTP 200, 4 s; deleting @ 10:11:13+00, HTTP 200, 146 s; stopped @ 10:13:39+00
Run 017: creating @ 10:13:40+00, HTTP 200, 588 s; deleting @ 10:23:28+00, HTTP 200, 416 s; stopped @ 10:30:24+00
Run 018: creating @ 10:30:24+00, HTTP 200, 1 s; deleting @ 10:30:25+00, HTTP 200, 1 s; stopped @ 10:30:26+00
Run 019: creating @ 10:30:26+00, HTTP 200, 1 s; deleting @ 10:30:27+00, HTTP 200, 1 s; stopped @ 10:30:28+00
Run 020: creating @ 10:30:29+00, HTTP 200, 1 s; deleting @ 10:30:30+00, HTTP 200, 1 s; stopped @ 10:30:31+00

According to these logs, the problem started at 10:11:13+00 and ended at 10:30:24+00.

Full Kibana logs, collected until the problem resolved:

[2023-03-27T10:09:42.193+00:00][INFO ][node] Kibana process configured with roles: [background_tasks, ui]
[2023-03-27T10:10:08.315+00:00][INFO ][plugins-service] Plugin "cloudChat" is disabled.
[2023-03-27T10:10:08.323+00:00][INFO ][plugins-service] Plugin "cloudExperiments" is disabled.
[2023-03-27T10:10:08.325+00:00][INFO ][plugins-service] Plugin "cloudFullStory" is disabled.
[2023-03-27T10:10:08.327+00:00][INFO ][plugins-service] Plugin "cloudGainsight" is disabled.
[2023-03-27T10:10:08.389+00:00][INFO ][plugins-service] Plugin "profiling" is disabled.
[2023-03-27T10:10:08.559+00:00][INFO ][http.server.Preboot] http server running at https://0.0.0.0:5601
[2023-03-27T10:10:08.637+00:00][INFO ][plugins-system.preboot] Setting up [1] plugins: [interactiveSetup]
[2023-03-27T10:10:08.724+00:00][WARN ][config.deprecation] The default mechanism for Reporting privileges will work differently in future versions, which will affect the behavior of this cluster. Set "xpack.reporting.roles.enabled" to "false" to adopt the future behavior before upgrading.
[2023-03-27T10:10:09.402+00:00][INFO ][plugins-system.standard] Setting up [127] plugins: [translations,monitoringCollection,licensing,globalSearch,globalSearchProviders,features,mapsEms,licenseApiGuard,usageCollection,taskManager,cloud,guidedOnboarding,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,share,screenshotMode,banners,newsfeed,fieldFormats,expressions,screenshotting,dataViews,embeddable,uiActionsEnhanced,charts,esUiShared,customIntegrations,home,searchprofiler,painlessLab,grokdebugger,management,advancedSettings,spaces,security,snapshotRestore,lists,encryptedSavedObjects,telemetry,licenseManagement,files,eventLog,actions,notifications,console,bfetch,data,watcher,reporting,fileUpload,ingestPipelines,alerting,unifiedSearch,unifiedFieldList,savedSearch,savedObjects,graph,savedObjectsTagging,savedObjectsManagement,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,controls,eventAnnotation,dataViewFieldEditor,triggersActionsUi,transform,stackConnectors,stackAlerts,ruleRegistry,discover,fleet,indexManagement,remoteClusters,crossClusterReplication,indexLifecycleManagement,cloudSecurityPosture,discoverEnhanced,visualizations,canvas,visTypeXy,visTypeVislib,visTypeVega,visTypeTimeseries,rollup,visTypeTimelion,visTypeTagcloud,visTypeTable,visTypeMetric,visTypeHeatmap,visTypeMarkdown,dashboard,dashboardEnhanced,expressionXY,expressionTagcloud,expressionPartitionVis,visTypePie,expressionMetricVis,expressionLegacyMetricVis,expressionHeatmap,expressionGauge,lens,maps,dataVisualizer,cases,timelines,sessionView,kubernetesSecurity,threatIntelligence,osquery,observability,aiops,ml,synthetics,securitySolution,infra,upgradeAssistant,monitoring,logstash,enterpriseSearch,apm,visTypeGauge,dataViewManagement]
[2023-03-27T10:10:09.427+00:00][INFO ][plugins.taskManager] TaskManager is identified by the Kibana UUID: a6105a28-d930-490b-8b59-f2c69b42e4f9
[2023-03-27T10:10:09.809+00:00][INFO ][plugins.encryptedSavedObjects] Hashed 'xpack.encryptedSavedObjects.encryptionKey' for this instance: dxQcN1nN/Fx51+ZDZUMnnd1V47ICj00MlJt0ba1qnWM=
[2023-03-27T10:10:09.835+00:00][INFO ][plugins.notifications] Email Service Error: Email connector not specified.
[2023-03-27T10:10:11.006+00:00][WARN ][plugins.reporting.config] Found 'server.host: "0.0.0.0"' in Kibana configuration. Reporting is not able to use this as the Kibana server hostname. To enable PNG/PDF Reporting to work, 'xpack.reporting.kibanaServer.hostname: localhost' is automatically set in the configuration. You can prevent this message by adding 'xpack.reporting.kibanaServer.hostname: localhost' in kibana.yml.
[2023-03-27T10:10:11.378+00:00][INFO ][plugins.ruleRegistry] Installing common resources shared between all indices
[2023-03-27T10:10:11.469+00:00][INFO ][plugins.cloudSecurityPosture] Registered task successfully [Task: cloud_security_posture-stats_task]
[2023-03-27T10:10:13.388+00:00][INFO ][plugins.screenshotting.config] Chromium sandbox provides an additional layer of protection, and is supported for Linux Ubuntu 20.04 OS. Automatically enabling Chromium sandbox.
[2023-03-27T10:10:13.560+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. security_exception: [security_exception] Reason: failed to authenticate service account [elastic/kibana] with token name [soc_kibana-test-eck_32231284-ca1a-49fe-9ed6-1022b4b65752]
[2023-03-27T10:10:16.067+00:00][INFO ][plugins.screenshotting.chromium] Browser executable: /usr/share/kibana/x-pack/plugins/screenshotting/chromium/headless_shell-linux_x64/headless_shell
[2023-03-27T10:10:51.216+00:00][INFO ][savedobjects-service] Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations...
[2023-03-27T10:10:51.217+00:00][INFO ][savedobjects-service] Starting saved objects migrations
[2023-03-27T10:10:51.419+00:00][INFO ][savedobjects-service] [.kibana] INIT -> CREATE_NEW_TARGET. took: 118ms.
[2023-03-27T10:10:51.426+00:00][INFO ][savedobjects-service] [.kibana_task_manager] INIT -> CREATE_NEW_TARGET. took: 120ms.
[2023-03-27T10:10:56.099+00:00][INFO ][savedobjects-service] [.kibana] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 4681ms.
[2023-03-27T10:10:56.916+00:00][INFO ][savedobjects-service] [.kibana_task_manager] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 5490ms.
[2023-03-27T10:10:57.712+00:00][INFO ][savedobjects-service] [.kibana] MARK_VERSION_INDEX_READY -> DONE. took: 1613ms.
[2023-03-27T10:10:57.713+00:00][INFO ][savedobjects-service] [.kibana] Migration completed after 6413ms
[2023-03-27T10:10:58.903+00:00][INFO ][savedobjects-service] [.kibana_task_manager] MARK_VERSION_INDEX_READY -> DONE. took: 1987ms.
[2023-03-27T10:10:58.905+00:00][INFO ][savedobjects-service] [.kibana_task_manager] Migration completed after 7599ms
[2023-03-27T10:10:58.936+00:00][INFO ][status] Kibana is now unavailable
[2023-03-27T10:10:58.938+00:00][INFO ][plugins-system.preboot] Stopping all plugins.
[2023-03-27T10:10:58.943+00:00][INFO ][plugins-system.standard] Starting [127] plugins: [translations,monitoringCollection,licensing,globalSearch,globalSearchProviders,features,mapsEms,licenseApiGuard,usageCollection,taskManager,cloud,guidedOnboarding,telemetryCollectionManager,telemetryCollectionXpack,kibanaUsageCollection,share,screenshotMode,banners,newsfeed,fieldFormats,expressions,screenshotting,dataViews,embeddable,uiActionsEnhanced,charts,esUiShared,customIntegrations,home,searchprofiler,painlessLab,grokdebugger,management,advancedSettings,spaces,security,snapshotRestore,lists,encryptedSavedObjects,telemetry,licenseManagement,files,eventLog,actions,notifications,console,bfetch,data,watcher,reporting,fileUpload,ingestPipelines,alerting,unifiedSearch,unifiedFieldList,savedSearch,savedObjects,graph,savedObjectsTagging,savedObjectsManagement,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,controls,eventAnnotation,dataViewFieldEditor,triggersActionsUi,transform,stackConnectors,stackAlerts,ruleRegistry,discover,fleet,indexManagement,remoteClusters,crossClusterReplication,indexLifecycleManagement,cloudSecurityPosture,discoverEnhanced,visualizations,canvas,visTypeXy,visTypeVislib,visTypeVega,visTypeTimeseries,rollup,visTypeTimelion,visTypeTagcloud,visTypeTable,visTypeMetric,visTypeHeatmap,visTypeMarkdown,dashboard,dashboardEnhanced,expressionXY,expressionTagcloud,expressionPartitionVis,visTypePie,expressionMetricVis,expressionLegacyMetricVis,expressionHeatmap,expressionGauge,lens,maps,dataVisualizer,cases,timelines,sessionView,kubernetesSecurity,threatIntelligence,osquery,observability,aiops,ml,synthetics,securitySolution,infra,upgradeAssistant,monitoring,logstash,enterpriseSearch,apm,visTypeGauge,dataViewManagement]
[2023-03-27T10:11:02.452+00:00][INFO ][plugins.fleet] Task Fleet-Usage-Sender-1.1.0 scheduled with interval 1h
[2023-03-27T10:11:02.587+00:00][INFO ][plugins.monitoring.monitoring] config sourced from: production cluster
[2023-03-27T10:11:08.763+00:00][INFO ][http.server.Kibana] http server running at https://0.0.0.0:5601
[2023-03-27T10:11:08.899+00:00][INFO ][status] Kibana is now degraded (was unavailable)
[2023-03-27T10:11:10.543+00:00][INFO ][plugins.monitoring.monitoring.kibana-monitoring] Starting monitoring stats collection
[2023-03-27T10:11:10.544+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2023-03-27T10:11:11.493+00:00][INFO ][plugins.fleet] Task Fleet-Usage-Logger-Task scheduled with interval 15m
[2023-03-27T10:11:21.397+00:00][INFO ][plugins.ruleRegistry] Installed common resources shared between all indices
[2023-03-27T10:11:21.400+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .alerts-observability.uptime.alerts
[2023-03-27T10:11:21.409+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .alerts-security.alerts
[2023-03-27T10:11:21.411+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .preview.alerts-security.alerts
[2023-03-27T10:11:21.412+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .alerts-observability.logs.alerts
[2023-03-27T10:11:21.413+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .alerts-observability.metrics.alerts
[2023-03-27T10:11:21.414+00:00][INFO ][plugins.ruleRegistry] Installing resources for index .alerts-observability.apm.alerts
[2023-03-27T10:11:24.281+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.uptime.alerts
[2023-03-27T10:11:24.285+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.metrics.alerts
[2023-03-27T10:11:24.287+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-security.alerts
[2023-03-27T10:11:24.291+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.logs.alerts
[2023-03-27T10:11:24.293+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .alerts-observability.apm.alerts
[2023-03-27T10:11:27.509+00:00][INFO ][plugins.ml] Task ML:saved-objects-sync-task: scheduled with interval 1h
[2023-03-27T10:11:34.057+00:00][INFO ][plugins.ruleRegistry] Installed resources for index .preview.alerts-security.alerts
[2023-03-27T10:11:34.073+00:00][INFO ][status] Kibana is now available (was degraded)
[2023-03-27T10:11:34.219+00:00][INFO ][plugins.fleet] Running Fleet Usage telemetry send task
[2023-03-27T10:11:34.292+00:00][INFO ][plugins.securitySolution.endpoint:metadata-check-transforms-task:0.0.1] no endpoint installation found
[2023-03-27T10:11:34.303+00:00][INFO ][plugins.reporting.store] Creating ILM policy for managing reporting indices: kibana-reporting
[2023-03-27T10:11:40.903+00:00][INFO ][plugins.fleet] Fleet Usage: {"agents_enabled":true,"agents":{"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"total_all_statuses":0,"updating":0},"fleet_server":{"total_all_statuses":0,"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"updating":0,"num_host_urls":0}}
[2023-03-27T10:11:48.964+00:00][INFO ][plugins.ml] Task ML:saved-objects-sync-task: 1 ML saved object synced
[2023-03-27T10:11:52.596+00:00][INFO ][plugins.fleet] Fleet setup completed
[2023-03-27T10:11:52.702+00:00][INFO ][plugins.securitySolution] Dependent plugin setup complete - Starting ManifestTask
[2023-03-27T10:11:52.980+00:00][ERROR][plugins.fleet] Failed to fetch latest version of synthetics from registry: '403 Forbidden' error response from package registry at https://epr.elastic.co/search?package=synthetics&prerelease=true&kibana.version=8.6.2
[2023-03-27T10:12:13.243+00:00][INFO ][plugins.synthetics] Installed synthetics index templates
[2023-03-27T10:13:11.045+00:00][ERROR][plugins.security.authorization] Error registering Kibana Privileges with Elasticsearch for kibana-.kibana: Request timed out
[2023-03-27T10:15:11.316+00:00][ERROR][plugins.security.authorization] Error registering Kibana Privileges with Elasticsearch for kibana-.kibana: Request timed out
[2023-03-27T10:26:45.416+00:00][INFO ][plugins.fleet] Fleet Usage: {"agents_enabled":true,"agents":{"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"total_all_statuses":0,"updating":0},"fleet_server":{"total_all_statuses":0,"total_enrolled":0,"healthy":0,"unhealthy":0,"offline":0,"updating":0,"num_host_urls":0}}

Elasticsearch and ECK operator logs are too verbose, so here are the pastebin links: Elasticsearch; ECK operator. The notable timestamps are highlighted by lines beginning with '####'.

Does anyone know what exactly this is caused by, or how can I troubleshoot this? If this cluster freeze is unavoidable, is there a way I can quickly and reliably test the cluster for it (i.e., test whether some cluster objects exist), so that we can have our scripts wait for it to pass? Should I discard ECK Kibana in favor of the regular one?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.