Kibana unexpectedly shutting down

We are currently running an Elasticsearch & Kibana (7.17.9) cluster with:

  • 3 master nodes and 6 data nodes

  • Kibana is installed on each of the 3 master nodes (total 3 instances)

The thing is that the Kibana server (01-03) randomly shuts down with an error (about once a year)

"ETIMEOUT at TCP.onStreamRead (node:internal/stream_base_commons:217:20)

so I have to restart the Kibana service manually, Elasticsearch was running as normal though.

We are running Kibana in the background.. I was thinking about creating a bash script to detect and restart Kibana in case of failure automatically

If anyone has better suggestions or experience handling this kind of issue, please let me know

Thanks in advance

Welcome to the forum @levi19 !

Other might help more on the specific issue, but

Do they all shut down at the same time, or just a random one of the 3 ?

But anyways, if someone suggests a theory and a fix, and I hope they do, it's going to be a while before you can be sure it actually helped. Maybe just a pragmatic approach is also helpful.

I am presuming you mean just restarting kibana is enough to get things back to BAU. You don't say anything about the host, but on Linux (and likely windows) there are plenty of tools that can monitor is something is running and restart it if not. e.g. with systemctl on linux you can set to restart on errors, see here for an example doc.

EDIT: although I note your say elasticsearch cluster was running fine when kibana crashed, please also check the elasticsearch logs for any warning or error or otherwise unusual messages that might give a clue as to what went wrong with kibana

2 Likes

Thank you for replying to me.
They all run on Linux servers.
Do they all shut down at the same time, or just a random one of the 3?
=> No, it’s just shut down randomly.
1 month ago, the kibana-03 was shut down

2 weeks later, the kibana-01 was shut down, and the kibana-02 was just shut down yesterday

I noticed they will shut down around 07:03 - 07:04. Also, this is the time when the current elastic master ( the one with “*” icon ) will rotate the log ( I don’t know how to configure the time for rotating the log )

Following your suggestion, I checked the log of elasticsearch-master-02 ( where the kibana-02 is installed ), and I found the line like this

[2025-08-09T07:03:00,837][WARN][o.e.t.TransportService] [elastic-master-02] Received response for a request that has timed out, sent [13s/13006ms] ago, timed out [3s/3002ms] ago, action [internal:coordination/fault_detection/follower_check], node [{elastic-master-03}{8yuiuoang-1vmkotppklRE}.....

I'm not sure whether it's related or not…

Any suggestions would be highly appreciated

check logs for all the elastic nodes please, you might find more similar messages..

But this sort of tallies with what you wrote, there are various timeout that default is 10s and likely you are occasionally hitting one, though a bit more often than the "about once a year"- at least 3 times in last month!

Mmmm. The old log is being gripped? If log files are decently big, this takes CPU power. I'm not sure how best you can control this. Maybe rotate more often, like hourly? This is controlled in log4j I think. Are the masters VMs or something virtual, where you could maybe give them more oomph?

I checked all the Elastic Search nodes and found that one of the data nodes has warning message like this

[o.e.t. OutboundHandler] [elastic-data-02] sending transport message [Request{internal:coordination/fault_detection/follower_check}{false}{false}{false} of size 412 on [Netty4TcpChannel…

Hello @levi19

Few observations :

  1. You mentioned that Kibana crashes consistently between 07:03–07:04, which coincides with the Elastic log index rollover.
    Does this rollover occur daily across all three nodes? If so and Kibana does not crash every day, it suggests that the rollover itself may not be the direct cause. It might be worth checking if there’s a specific condition or load spike during rollover on the days Kibana fails.

  2. Since Kibana is running on the same nodes as your Elasticsearch masters, I’m curious about the following:
    When Kibana crashes, is the affected node currently the master?
    Do you observe a master node re-election immediately after the crash?
    For example, if node 01 is the master and Kibana crashes on that node, does the master role shift to node 02? This could indicate that the crash is impacting the node’s stability and triggering a master change.

  3. The error line you shared is helpful, but sometimes the surrounding log entries (a few lines before and after) can provide critical context.
    Could you please share a broader snippet from the Kibana logs around the time of the crash?

  4. You mentioned that Kibana crashes on nodes 01, 02 and 03 show the same log type and error. Is the error pattern identical across all crashes?

Thanks!!

1 Like
  1. Yes, this rollover occurs daily across all three nodes
  2. No, the cluster health is in good condition, there’s no sign of re-election
  3. The log will be like this

{"type":"response","@timestamp":"2025-08-09T09:02:24+09:00","tags","pid":22903,"method":"get","statusCode":200,"req":{"url":"/api/status","method":"get","headers":{"host":"ip-kibana:5601","user-agent":"Elastic-Metricbeat/7.17.9 (linux; amd64; 47a345e45b4sdsgg4567jjk09; 2023-01-31 11:38:38 +0000 UTC)","x-elastic-product-origin":"beats","accept-encoding":"gzip"},"remoteAddress":"ip-metricbeat","user-agent":"Elastic-Metricbeat/7.17.9 (linux; amd64; 47a345e45b4sdsgg4567jjk09; 2023-01-31 11:38:38 +0000 UTC)"},"res":{"statusCode":200,"responseTime":5,"contentLength":24727},"message":"GET /api/status 200 5ms - 24.1KB"}

It will repeat until the time the crash starts

{"type":"log","@timestamp":"2025-08-09T09:03:24+09:00","tags":["info","plugins-service"],"pid":11324,"message":"Plugin "metricsEntities" is disabled"}
{"type":"log","@timestamp":"2025-08-09T09:03:24+09:00","tags":["info","http","server","Preboot"],"pid":11324,"message":"http server running at ``http://ip-kibana:5601``"} => The auto script has activated

=> As I’ve already configured the auto-restart script, it will restart right away (check every 1 min)

And then it will start some plugins, install some resources, indices…

{"type":"log","@timestamp":"2025-08-09T09:04:00+09:00","tags":["info","http","server","Kibana"],"pid":11324,"message":"http server running at http://ip-kibana:5601"}
{"type":"log","@timestamp":"2025-08-09T09:04:20+09:00","tags":["info","plugins","reporting","chromium"],"pid":11324,"message":"Browser executable /app/kibana-7.17.9-linux....x86/plugins/reporting/chromum/headless_shell-linux_x64/headless_shell"}
  1. Is the error pattern identical across all crashes? => Yes, it is

Thank you

1 Like

Just to check, these 3x instances running es-master and kibana are not all running on same physical host?

Do you have monitoring of the masters? so you can see (eg) the CPU load at reasonable resolution around the time the master crashes. How big are the log files?

And you wrote "

If this is a consistent pattern, then surely it cannot be coincidence? AFAIK by default (I simply can't recall for 7.x) elasticsearch log rotation will happen daily, and compress old logs with gzip, but this is all controlled in the log4j2.properties file. e.g. take away the .gz from the filePattern settings and it wont compress the associated logs.

1 Like