Kibana memory consumption growing linearly until it crashes

Hi there! I have an issue with a simple Elastic / Kibana deployment on OpenShift. Even with no data being ingested in Elastic and virtually no users (just me testing), Kibana’s memory consumption grows linearly until the container is OOM Killed. My configuration is quite minimalist and I use the official containers version 9.1.4. I have one Elastic pod and one Kibana pod.

This is what I tried so far:

  • Give the container more memory. I tried from 2 GiB up to 6 GiB but it just keeps taking up memory until it crashes, it never seems to stop.
  • Set NODE_OPTIONS with --max-old-space-size=1024, also tried 2048, 512.
  • Disabled some plugins and features:
telemetry.enabled: false
xpack.reporting.enabled: false
xpack.canvas.enabled: false
xpack.fleet.enabled: false
xpack.indicesMetadata.enabled: false
xpack.screenshotting.enabled: false

The logs don’t show anything that I find interesting.

Using the /api/status endpoint, I can see that the heap usage is constant, but resident_set_size_in_bytes keeps growing very fast.

λ oc exec -it deploy/kibana -- curl -k https://localhost:5601/api/status -k |jq .metrics.process.memory; date
{
  "heap": {
    "total_in_bytes": 396181504,
    "used_in_bytes": 379005328,
    "size_limit": 562036736
  },
  "resident_set_size_in_bytes": 706420736,
  "array_buffers_in_bytes": 492831,
  "external_in_bytes": 4514730
}
Mon Nov 24 13:38:06     2025
λ oc exec -it deploy/kibana -- curl -k https://localhost:5601/api/status -k |jq .metrics.process.memory; date
{
  "heap": {
    "total_in_bytes": 408502272,
    "used_in_bytes": 388229616,
    "size_limit": 562036736
  },
  "resident_set_size_in_bytes": 808656896,
  "array_buffers_in_bytes": 566169,
  "external_in_bytes": 4588012
}
Mon Nov 24 13:45:15     2025
λ oc exec -it deploy/kibana -- curl -k https://localhost:5601/api/status -k |jq .metrics.process.memory; date
{
  "heap": {
    "total_in_bytes": 413745152,
    "used_in_bytes": 397290256,
    "size_limit": 562036736
  },
  "resident_set_size_in_bytes": 901521408,
  "array_buffers_in_bytes": 638061,
  "external_in_bytes": 4659680
}
Mon Nov 24 13:52:15     2025
λ oc exec -it deploy/kibana -- curl -k https://localhost:5601/api/status -k |jq .metrics.process.memory; date
{
  "heap": {
    "total_in_bytes": 447463424,
    "used_in_bytes": 430677584,
    "size_limit": 562036736
  },
  "resident_set_size_in_bytes": 1174245376,
  "array_buffers_in_bytes": 846778,
  "external_in_bytes": 4868845
}
Mon Nov 24 14:11:28     2025

I would appreciate any pointers towards investigating the root cause of this issue.

Cheers,

Fabio

So far I could observe the following:

  1. Kibana memory allocation growing linearly until container crashed (OOMKilled).
  2. Heap dump shows large retention of the following objects: ServerHttp2Session, Http2Session, TLSSocket, TLSWrap.
  3. When OpenShift route is deleted, then memory allocation stops increasing. When OpenShift route is recreated, then memory allocation grows again until container crashed. It does not matter if the route is being actively invoked.
  4. Looking at /proc/net/tcp I don’t see stale connections, it seems the connections get closed properly, but the sessions are not properly purged from memory.

Welcome to the forum. And sorry that no-one answered your post first time.

Did you double check with eg tcpdump ?

Before it crashes, is Kibana operable and properly connected to elasticsearch ? any interesting logs from elasticsearch side ?

Hey, thanks for having a look.

So far I could confirm the following facts.

The memory leak only happens when TLS is enabled for the Kibana server and the OpenShift router is enabled. The router performs 4 health checks per second (2 routers, every 500ms) by establishing a TCP connection and terminating it. It’s an HA Proxy with check inter 500. When that happens, I see the memory consumption going up until container crashes.

When memory leak does not happen:

  • Kibana service with TLS enabled and no OpenShift route.
  • Kibana service with TLS disabled, with or without OpenShift route (termination: edge).
  • Kibana service with TLS enabled but with server.protocol: http1 .

For now, we’ll go with server.protocol: http1. But you may want to have a look at why the ServerHttp2Session are never getting released in this case.

Unfortunately, I’m not able to attach images here, I always get an error message.

Cheers,

Fabio

Can you share the haproxy config please?

the team responsible for the openshift cluster informed me that check inter 500 is the relevant part, haproxy is therefore making the connections

Respectfully, that’s not helpful.

If I were flippant, I’d say “ask the team responsible for the openshift cluster” to figure it out then :slight_smile:

More seriously, on the face of it you could have uncovered a kibana bug. That would be useful to confirm/find/fix. Kibana should not leak memory, certainly not to point of OOM. But, if I’ve understood correctly, the leak only happens with haproxy in a TLS-enabled flow, and seems related to the checks performed every 500ms, right? What’s performing those checks, and how is it configured? The answer is haproxy, and the configuration is, IMO, more than just check inter 500. I also suggested you validate that the connections are correctly closed with tcpdump. Did you do that?

EDIT: Not sure if it helps, but enable the debugger by sending SIGUSR1 to the node process with kill, which will log something like:

Debugger listening on ws://127.0.0.1:9229/b818ce57-315e-4110-bd83-9b72a17fc3d1

You can then attach with (eg) Chrome DevTools and monitor the creation/deletion of sockets, sessions, etc.

I’m in a corporate environment where things are not so simple. They claim they use this configuration for many services since many years and no one ever complained, so the problem must be at Kibana. So far I have no arguments to disagree with them.

I don’t have permissions to tcpdump or port forward or anything else that requires privileged acccess, but I can look at /proc/net/tcp, where I can see:

sh-5.1$ cat /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 00000000:15E1 00000000:0000 0A 00000000:00000000 00:00000000 00000000 1000850000        0 350421419 1 0000000000000000 100 0 0 10 0
   1: 980213AC:A8DA A20111AC:23F0 01 00000000:00000000 02:00000012 00000000 1000850000        0 417975169 2 0000000000000000 20 4 28 10 -1
   2: 980213AC:A962 A20111AC:23F0 01 00000000:00000000 02:00000012 00000000 1000850000        0 350421457 2 0000000000000000 20 4 30 10 -1
   3: 980213AC:D6B0 A20111AC:23F0 01 00000000:00000000 02:0000004B 00000000 1000850000        0 418617447 3 0000000000000000 20 4 30 10 -1

The relevant part being the line with local_address 00000000:15E1 (15E1 is hex for 5601) and st 0A (LISTEN).

When a connection is established (for example from the haproxy health checks), I can see for a very short time a connection with st 01 (ESTABLISHED) on that port. But even with many thousands sessions in Kibana’s memory, /proc/net/tcp shows 0 established (or time_wait) connections.

The problem only happens with TLS enabled in Kibana and http2 mode (the default). As soon as I start Kibana with TLS and server.protocol: http1 the leak is gone.

The haproxy config makes it establish a TCP connection and close it immediately, without performing a TLS handshake.

For us, the workaround with http1 is valid, but I’d glad to help you guys.

@Fabio_Hecht also wished to share this screenshot:

first of all, I understand your point on having limited access to in a corporate environment. Been there myself.

Well, same works both ways, many people use kibana in conjunction with haproxy, did so myself in the past, and people have done so for many years. And AFAIK this (your experience) is not a known issue.

I don’t work for elastic, have no skin in the game, just trying to help. My take is that you might have uncovered something in kibana which, if confirmed, should be found/fixed. You might just have an effectively broken haproxy config. Or indeed something else entirely.

You can open a GitHub bug on it.

I would not be the one working on that bug report, but usually someone attempts to reproduce the issue with the minimal configuration. In a parallel thread, soneone found a a bug in top hits aggregations, could not be reproduced easily by elastic staff, so same sonmeone sent Elastic a couple of GB of data to show how to reproduce it –> bug was identified quickly (same day!). In your case, they are likely going to want/need to see the full haproxy and kibana config, since by your own report its only when haproxy is added to the mix that you have a problem.

One little thing - does the leak happen without any real/user traffic? i.e. I asked above is it actually just the every 500ms check which is enough to cause the leak?

Thanks, I will try and open the issue on github this week.

I think someone can try and reproduce it with netcat or another tool that can open and immediately close TCP connections. If it does not work then try with a real haproxy.

Can you please attach the other screenshot I sent you, the one with the heap dump analysis? Thanks.

And yes, the memory is leaked day and night and only when the haproxy is sending the health checks.

Cheers,

Fabio

Here's other screenshot, sorry I missed it.

IMO they will need your haproxy config to reproduce your issue, but maybe I am wrong.

#!/bin/sh
while true; do
  nc -z 127.0.0.1 5601
  sleep 1
done

(replace IP with your own)

Should reproduce the issue, no? Does it in your case?

Did you? Any news/progress?

So far I wasn’t able to reproduce the issue with nc -z in a loop. I will try and collect some traces to check what those health checks are doing and hopefully be able to reproduce the issue.