Node loop crashing

Hey there,

So, I currently have a cluster running for timeseries with security enabled.

It has 5 nodes:

  • 1 dedicated master-eligible node, let's call it node1
  • 2 master-elligible, data and ingest nodes, tagged hot, let's call them node2 and node3
  • 2 data and ingest nodes, tagged cold, let's call them node4 and node5

It has around 20 Beats sending data to node2 or node3, both are always configured. It's a dev environment, so I know that this is bad practice, but the cluster is monitoring itself.

Everything was working fine until recently. Indeed, node2 is crashing in a loop continuously (general info: it has around 70 shards on it).

The pattern is the following:

  • crash and restart of the node
  • cluster looses it. Then cluster finds it again and start reallocating the shards on it.
  • when all shards are back all good:
    • crash again.
  • etc.

The problem seems to come from a parsing issue:

Caused by: org.elasticsearch.ElasticsearchParseException: failed to parse date field [2020-05-07T11:36:12.395Z] with format [epoch_millis]: [failed to parse date field [2020-05-07T11:36:12.395Z] with format [epoch_millis]]

I only found some occurences of this with other people when using Filebeat in some older versions, but the only FIlebeat which was sending data to this cluster is up-to-date and now turned off (and well, for some reason, it always sent data to a "ghost" standalone cluster instead of this one. So I really don't understand what could be causing this now when everything was working fine before (and I can't point the beginning of this issue to an upgrade or change of config, although it's definitely possible).

Any idea of what could be causing these crashes? I'm running out of ideas...

For information, here are the full logs (parsed by myself) of ES on the problematic node when this happens:

EDIT: Can't write it here directly as it's too long... So still looking for a workaround.

Thanks in advance for your answers!

Cheers,

jsu

This shouldn't be caused by the formatting error.

What do the logs before the "crash" show?

1 Like

Hey @warkolm, thanks for the answer!

Here, I'll try to paste the useful part (as I couldn't paste the whole thing before).

[2020-05-07T11:38:44,619][DEBUG][o.e.a.s.TransportSearchAction] [node2] [ilm-history-1-000001][0], node[HVFVdzG2QneJ38vvF-Ekhw], [R], s[STARTED], a[id=6JJO3updTyegFh9B3UWxLA]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[.apm-agent-configuration, .kibana, .kibana_1, .kibana_2, .kibana_3, .kibana_4, .kibana_task_manager, .kibana_task_manager_1, .kibana_task_manager_2, .kiba na_task_manager_3, .kibana_task_manager_4, .ml-annotations-6, .ml-annotations-read, .ml-annotations-write, .ml-anomalies-.write-asdf, .ml-anomalies-.write-sadf, .ml-anomalies-asdf, .ml-anomalies-sadf, .ml-anomalies-shared, .ml-config, .ml-notifications, .ml-state, .ml-state-write, .monitoring-alerts-7, .monitoring-beats-7-2020.05.01, .monitoring-beats-7-2020.05.02, .monitoring-beats-7-2020.05.03, .mo nitoring-beats-7-2020.05.04, .monitoring-beats-7-2020.05.05, .monitoring-beats-7-2020.05.06, .monitoring-beats-7-2020.05.07, .monitoring-es-7-2020.05.01, .monitoring-es-7-2020.05.02, .monitoring-es-7-2020.05.03, .monitoring-es-7-2020.05.04, .monitoring-es-7-2020.05.05, .monitoring-es-7-2020.05.06, .monitoring-es-7-2020.05.07, .monitoring-kibana-7-2020.05.01, .monitoring-kibana-7-2020.05.02, .monitori ng-kibana-7-2020.05.03, .monitoring-kibana-7-2020.05.04, .monitoring-kibana-7-2020.05.05, .monitoring-kibana-7-2020.05.06, .monitoring-kibana-7-2020.05.07, .reporting-2020.04.26, .reporting-2020.05.03, .siem-signals-default, .siem-signals-default-000001, .siem-signals-default-000002, .siem-signals-default-000003, .slm-history-1, .slm-history-1-000001, .slm-history-1-000002, .slm-history-1-000003, .sl m-history-1-000004, .tasks, .triggered_watches, .watches, auditbeat, auditbeat-000001, auditbeat-000002, auditbeat-000003, auditbeat-000004, auditbeat-000005, auditbeat-000006, auditbeat-000007, auditbeat-000008, indices_e, indices_e-000001, indices_e-000002, indices_e-000003, indices_e-000004, indices_e-000005, indices_e-000006, indices_e-000007, indices_e-000008, indices_f, indices_a-000002, indices_a-000003, indices_a-000004, indices_a-000005, indices_a-000006, indices_a-000007, indices_a-000008, indices_b, indices_b-000001, indices_b-000002, indices_b-000003, indices_b-000004, indices_b-000005, indices_b-000006, indices_b-000007, indices_c, indices_c- 000002, indices_c-000003, indices_c-000004, indices_c-000005, indices_c-000006, indices_c-000007, indices_c-000008, indices_d, indices_d-000002, indices_d-000003, indices_d-000004, indices_d-000005, indices_d-000006, indices_d-000007, indices_d-000008, filebeat, filebeat-000001, filebeat-000002, filebe at-000003, filebeat-000004, filebeat-000005, filebeat-000006, filebeat-000007, filebeat-000008, ilm-history-1, ilm-history-1-000001, ilm-history-1-000002, ilm-history-1-000003, metricbeat, metricbeat-000002, metricbeat-000003, metricbeat-000004, metricbeat-000005, metricbeat-000006, metricbeat-000007, metricbeat-000008, metricbeat-000009, metricbeat-000010, metricbeat-000011, metricbeat-demo, metricb eat-demo-000001, metricbeat-demo-000002, metricbeat-demo-000003, metricbeat-demo-000004, metricbeat-demo-000005, packetbeat, packetbeat-000001, packetbeat-000002, packetbeat-000003, packetbeat-000004, packetbeat-000005, packetbeat-000006, packetbeat-000007, packetbeat-000008, rollover_test_1, rollover_test_1-000001, rollover_test_1-000002, shrink-auditbeat-000001, shrink-auditbeat-000002, shrink-audi tbeat-000003, shrink-auditbeat-000004, shrink-auditbeat-000005, shrink-auditbeat-000006, shrink-auditbeat-000007, shrink-indices_e-000001, shrink-indices_e-000002, shrink-indices_e-000003, shrink-indices_e-000004, shrink-indices_e-000005, shrink-indices_e-000006, shrink-indices_e-000007, shrink-indices_a-000002, shrink-indices_a-000003, shrink-indices_a-000004, shrink-indices_a-000005, shrink-indices_a-000006, shrink-indices_a-000007, shrink-indices_b-000001, shrink-indices_b-000002, shrink-indices_b-000003, shrink-indices_b-000004, shrink-indices_b-000005, shrink-indices_b-000006, shrink-indices_c-000002, shrink-indices_c-000003, shrink-indices_c-000004, shrink-indices_c-00 0005, shrink-indices_c-000006, shrink-indices_c-000007, shrink-indices_d-000001, shrink-indices_d-000002, shrink-indices_d-000003, shrink-indices_d-000004, shrink-indices_d-000005, shrink-indices_d-000006, shrink-indices_d-000007, shrink-filebeat-000001, shrink-filebeat-000002, shrink-filebeat-000003, shrink-filebeat-000004, shrink-f ilebeat-000005, shrink-filebeat-000006, shrink-filebeat-000007, shrink-metricbeat-000001, shrink-metricbeat-000002, shrink-metricbeat-000003, shrink-metricbeat-000004, shrink-metricbeat-000005, shrink-metricbeat-000006, shrink-metricbeat-000007, shrink-metricbeat-000008, shrink-metricbeat-000009, shrink-metricbeat-000010, shrink-metricbeat-demo-000001, shrink-metricbeat-demo-000002, shrink-metricbeat -demo-000003, shrink-metricbeat-demo-000004, shrink-packetbeat-000001, shrink-packetbeat-000002, shrink-packetbeat-000003, shrink-packetbeat-000004, shrink-packetbeat-000005, shrink-packetbeat-000006, shrink-packetbeat-000007, shrink-rollover_test_1-000001, shrink-rollover_test_1-000002, shrink-winlogbeat-000001, shrink-winlogbeat-000002, shrink-winlogbeat-000003, shrink-winlogbeat-000004, shrink-win logbeat-000005, shrink-winlogbeat-000006, shrink-winlogbeat-000007, winlogbeat, winlogbeat-000001, winlogbeat-000002, winlogbeat-000003, winlogbeat-000004, winlogbeat-000005, winlogbeat-000006, winlogbeat-000007, winlogbeat-000008], indicesOptions=IndicesOptions[ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indice s=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=Scroll{keepAlive=1m}, maxConcurrentShardRequests=0, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"size":1000,"query":{"bo ol":{"must":[{"range":{"@timestamp":{"from":"2020-05-07T11:36:12.395Z","to":"2020-05-07T11:38:42.395Z","include_lower":true,"include_upper":true,"boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}}}}] lastShard [true] org.elasticsearch.transport.RemoteTransportException: [node5][111.22.10.122:9300][indices:data/read/search[phase/query]]

Caused by: org.elasticsearch.ElasticsearchParseException: failed to parse date field [2020-05-07T11:36:12.395Z] with format [epoch_millis]: [failed to parse date field [2020-05-07T11:36:12.395Z] with format [epoch_millis]]

Caused by: java.lang.IllegalArgumentException: failed to parse date field [2020-05-07T11:36:12.395Z] with format [epoch_millis]

Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: date_time_parse_exception: Failed to parse with all enclosed parsers


[2020-05-07T11:38:48,398][WARN ][r.suppressed             ] [node2] path: /_monitoring/bulk, params: {system_id=kibana, system_api_version=6, interval=10000ms}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];

Is there more before this you can share?

You can also use gist/pastebin/etc if there's not enough space here :slight_smile:

1 Like

Ok sure.

As I'm not sure what will be helpful, here is the whole last .log file, without all the stack traces.

[2020-05-12T08:34:20,639][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [36718ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices

That seems way more relevant.

What is the output of _cat/nodes?v and _cat/allocation?v?

1 Like

Well, my understanding so far was that the timeout was already a sign of the node being unavailable.

Looking around, I thought that it was possible that the parsing issue could flood the logs of a node and impact its performance, which could then lead to timeouts/crash.

So, here's the output of _cat/nodes?v&s=name:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
111.22.10.100           31          97   8    0.26    0.41     0.44 m         *      node1
111.22.10.105           37          95  12    3.62    4.26     5.36 dilm      -      node2
111.22.10.118           42          97  15    1.57    3.09     3.52 dilm      -      node3
111.22.10.120           69          96  21    0.86    0.76     0.81 di        -      node4
111.22.10.122           46          93  14    1.27    1.07     0.90 di        -      node5

and here's the _cat/allocation?v&s=node:

shards disk.indices disk.used disk.avail disk.total disk.percent host          ip            node
    30                                                                                       UNASSIGNED
    35       19.1gb   164.3gb      179gb    343.3gb           47 111.22.10.105 172.16.10.105 node2
    81      176.1gb   206.3gb      137gb    343.3gb           60 111.22.10.118 172.16.10.118 node3
    83      373.6gb   407.4gb     83.5gb      491gb           82 111.22.10.120 172.16.10.120 node4
    83      351.1gb   384.7gb    106.3gb      491gb           78 111.22.10.122 172.16.10.122 node5

node2 crashed not so long ago again and shards are currently being reallocated on it.

Thanks a lot for all the help already !

This node looks overloaded to the point of collapse. An excerpt from the logs:

[2020-05-12T10:44:00,469][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [33900ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices
[2020-05-12T10:44:08,615][INFO ][o.e.i.e.I.EngineMergeScheduler] [node2] [auditbeat-000008][0] now throttling indexing: numMergesInFlight=8, maxNumMerges=7
[2020-05-12T10:45:04,356][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [19344ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices
[2020-05-12T10:46:05,078][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:46:26,963][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:48:20,556][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:48:41,292][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:49:03,119][WARN ][o.e.t.TransportService   ] [node2] Received response for a request that has timed out, sent [14799ms] ago, timed out [2548ms] ago, action [internal:coordination/fault_detection/leader_check], node [{node1}{OSTSZK4sTWS8GNRudXxehQ}{4qkI2Ik5RvKMr-pEG32P-Q}{111.22.10.100}{111.22.10.100:9300}{m}{xpack.installed=true}], id [8231296]
[2020-05-12T10:52:00,572][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:52:10,574][ERROR][o.e.x.m.c.n.NodeStatsCollector] [node2] collector [node_stats] timed out when collecting data
[2020-05-12T10:52:36,823][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [14745ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices
[2020-05-12T10:53:08,099][INFO ][o.e.i.e.I.EngineMergeScheduler] [node2] [auditbeat-000008][0] stop throttling indexing: numMergesInFlight=6, maxNumMerges=7
[2020-05-12T10:53:55,369][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [76158ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices
[2020-05-12T10:55:04,263][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [62288ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices
[2020-05-12T10:56:00,339][WARN ][o.e.m.j.JvmGcMonitorService] [node2] [gc][656669] overhead, spent [3.4s] collecting in the last [4s]
[2020-05-12T10:56:05,597][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [61418ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [137] unchanged indices
[2020-05-12T10:56:28,616][WARN ][o.e.g.PersistedClusterStateService] [node2] writing cluster state took [22924ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [138] unchanged indices

This small excerpt contains multiple reports that it took over a minute to write the cluster state to disk (probably only a few MB), a GC pause of over 3 seconds, and some throttling of indexing because merging cannot keep up. Maybe its disk is inadequate or failing, or maybe it's sharing storage with another heavyweight process, or maybe you're just asking too much of it.

3 Likes

Hi @DavidTurner , thanks for the answer!

Indeed, it looks like it's collapsing. The thing I don't understand is that the server is only running this and is not used for anything else and the metrics seem to always be within acceptable boundaries.

Here the metrics of the last 4 hours during which 3 crashes occurred:

So, would it be correct to assume that the node is crashing due to something elastic related? If so, what could I check to find the root cause?

To reiterate, it's taking over a minute to write a small number of tiny files to disk. That doesn't sound Elastic-related to me, it sounds like a failing disk.

Hi @DavidTurner,

Sorry for the late reply, but following your advice, we completely changed node2 and moved it somewhere else and I was waiting for everything to settle down before giving any feedback.

Unfortunately, looking back at what's been happening since then, it seems it didn't solve our problem. Indeed, now we have node2 and node3 which are both crashing (the elastic nodes, the servers are not crashing at all).

Here are the metrics of the last 12h for node2:


and for node3:


Any idea how to continue to investigate what's happening?

Thanks again for all the help!

What sort of hardware does this run on?

1 Like

It's a linux VM on local servers. And none of the other VMs are showing any problem.

Here are logs from node3 (there are hundreds of lines before and after this part, but they all show the same failed to execute pipline part:

and here are the logs from node2:

Your nodes definitely look congested, your threadpools are filling up and rejecting requests. Then you get into really bad GC cycles. Then the cluster state times out and things fall over.

Can you post the output of _cat/nodes?v and _cat/allocation?v? David might have something else to offer.

1 Like

These nodes look to have some really serious performance issues:

[2020-05-15T08:26:08,471][WARN ][o.e.g.PersistedClusterStateService] [node3] writing cluster state took [91083ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [0] indices and skipped [144] unchanged indices

In this cluster state update nothing changed except the version (and maybe some stuff that doesn't get written out) so it's a very small update, and yet it took over 90 seconds to write it to disk. I think you'll need to deal with your local sysadmin folks to investigate that further, I don't see a way to get Elasticsearch to work acceptably if your disks are this slow.

2 Likes

Thanks for the quick answers and all the help anyway already.

So, here's the output of GET _cat/nodes?v&s=name:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
111.22.10.100           21          94  12    0.16    0.29     0.23 m         *      node1
111.22.10.129           60          97   7    1.49    4.29     4.04 dilm      -      node2
111.22.10.118           42          97   8   14.70   11.43    10.86 dilm      -      node3
111.22.10.120           40          96   4    0.11    0.08     0.10 di        -      node4
111.22.10.122           55          96   3    0.27    0.20     0.24 di        -      node5

And the output from GET _cat/allocation?v&s=node:

shards disk.indices disk.used disk.avail disk.total disk.percent host          ip            node
    67                                                                                       UNASSIGNED
    43      143.7gb   179.4gb    163.9gb    343.3gb           52 111.22.10.129 111.22.10.129 node2
     4        118gb     175gb    168.3gb    343.3gb           50 111.22.10.118 111.22.10.118 node3
    76      181.8gb   214.5gb    276.4gb      491gb           43 111.22.10.120 111.22.10.120 node4
    86      175.8gb   208.6gb    282.3gb      491gb           42 111.22.10.122 111.22.10.122 node5

Thanks for your answer!

I am also doing that in parallel but without much success. I'll go back to them then.

But one question I still have here is: would it be possible that the slowness of the the servers could actually be caused by something Elasticsearch-related? Or this is something that you can completely exclude from the logs and other information I sent before?

I don't want to say it's completely impossible, because computers are hard and bugs do happen, but I can't think of a way that Elasticsearch could be doing this to itself. I've seen other cases that look a lot like this and each one has turned out to be infrastructural in the end.

It's not an easy thing to check for certain, however. My best idea would be to break out strace and verify that Elasticsearch is waiting a long time on some IO-related syscalls, and also that it's not making an unreasonable number of such syscalls, but this is pretty low-level stuff and strace isn't particularly lightweight so might itself be disruptive to your cluster.

1 Like

Well, thanks again for your time and all the help then.

I'll go back to them and try to fix this on the server side and I'll come back here whenever (if) we find a solution.

Cheers!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.