Elastic node hangs or stops after few hours of load


#1

Hello,

After trobleshooting this problem for days, I decided to ask for help elasticsearch community. We are running 5 node elastic cluster, on Oracle linux 7.4. All machines have 64G RAM, and each one has 10 disks with about 500 GB. The data is sent to elasticsearch through 3 nodes graylog cluster.
The setup was working just fine whn we had less load, like 1000msg per second, mostly DNS type of messages. But as we added addtional traffic which has arround 15 000 msg per second(from firewalls), troubles started.
I created new index for these messages in a config : 4 shards per index, 1 replica and it worked but elasticsearch could not process messages quickly enough, so there was a long queue in graylog building.
I increased configuraiton to 8 shards per nodes, 1 replica and this caused better speed and elastcisearch could process messages in time. But after this was runnig for something like 5-10 hours, individual elasticsearch processes just hang up (elastic process there, but not responding and only kil -9 stops it) or they just out of the blue stopped. Yesterday I also set replicas to 0. It seem it worked a bit longer, like 18 hours but then again elastic process has stopped on one of the nodes.

I am sorry for very long post, but I wanted to give as much details as possible. Thanks for any ideas.

All the time, even when processing is OK, there are some messages in the elastic logs like:

  1. GC messages: mostly of INFO nature, rarely WARNINGS.
    [2018-02-06T09:53:00,560][INFO ][o.e.m.j.JvmGcMonitorService] [RhfJyb1] [gc][18] overhead, spent [297ms] collecting in the last [1s]
    [2018-02-06T10:01:16,006][INFO ][o.e.m.j.JvmGcMonitorService] [RhfJyb1] [gc][503] overhead, spent [289ms] collecting in the last [1.1s]

easch elasticsearch is using 30gm of memory, so that can not be increased.

  1. All shards failed errors:
    [2018-02-06T00:01:12,260][DEBUG][o.e.a.s.TransportSearchAction] [RhfJyb1] All shards failed for phase: [query]
    org.elasticsearch.ElasticsearchException$1: value source config is invalid; must have either a field context or a script or marked as unwrapped
    at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618) ~[elasticsearch-5.6.6.jar:5.6.6]
    at ..

Caused by: java.lang.IllegalStateException: value source config is invalid; must have either a field context or a script or marked as unwrapped
at org.elasticsearch.search.aggregations.support.ValuesSourceConfig.toValuesSource(ValuesSourceConfig.java:227) ~[elasticsearch-5.6.6.jar:5.6.6]

  1. DEBUG messages: TransportSearcAction and RemoteTransportException
    [2018-02-06T00:01:12,235][DEBUG][o.e.a.s.TransportSearchAction] [RhfJyb1] [srx_49][6], node[aJ5VQPYbQZmCuVMkbSPQLQ], [P], s[STARTED], a[id=ur1bjvESQOixmyGuCeWtMg]
    : Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[srx_49], indicesOptions=IndicesOptions[id=39, ignore_unavailable=true, allow_no_indices=t
    rue, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[message], routing='nu
    ll', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=25, batchedReduceSize=512, preFilterShardSize=64, source={
    "from" : 0,
    "query" : {
    "bool" : {
    "must" : [
    {
    "match_all" : {
    "boost" : 1.0
    }
    }
    ],
    "filter" : [
    {
    "bool" : {
    "must" : [
    {
    "range" : {
    "timestamp" : {
    "from" : "2018-02-05 23:00:12.227",
    "to" : "2018-02-05 23:01:12.227",
    "include_lower" : true,
    "include_upper" : true,
    "boost" : 1.0
    }
    }
    },
    {
    "query_string" : {
    "query" : "streams:59dce05bac68b10ce707c2c1",
    "fields" : [ ],
    "use_dis_max" : true,
    "tie_breaker" : 0.0,
    "default_operator" : "or",
    "auto_generate_phrase_queries" : false,
    "max_determinized_states" : 10000,
    "enable_position_increments" : true,
    "fuzziness" : "AUTO",
    "fuzzy_prefix_length" : 0,
    "fuzzy_max_expansions" : 50,
    "phrase_slop" : 0,
    "escape" : false,
    "split_on_whitespace" : true,
    "boost" : 1.0
    }
    }
    ],
    "disable_coord" : false,
    "adjust_pure_negative" : true,
    "boost" : 1.0
    }
    }
    ],
    "disable_coord" : false,
    "adjust_pure_negative" : true,
    "boost" : 1.0
    }
    },
    "aggregations" : {
    "gl2_filter" : {
    "filter" : {
    "match_all" : {
    "boost" : 1.0
    }
    },
    "aggregations" : {
    "gl2_terms" : {
    "terms" : {
    "size" : 100,
    "min_doc_count" : 1,
    "shard_min_doc_count" : 0,
    "show_term_doc_count_error" : false,
    "order" : [
    {
    "_count" : "desc"
    },
    {
    "_term" : "asc"
    }
    ]
    }
    }
    }
    },
    "missing" : {
    "missing" : { }
    }
    }
    }}] lastShard [true]
    org.elasticsearch.transport.RemoteTransportException: [aJ5VQPY][10.16.10.208:9300][indices:data/read/search[phase/query]]
    Caused by: java.lang.IllegalStateException: value source config is invalid; must have either a field context or a script or marked as unwrapped
    at org.elasticsearch.search.aggregations.support.ValuesSourceConfig.toValuesSource(ValuesSourceConfig.java:227) ~[elasticsearch-5.6.6.jar:5.6.6]
    at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:51) ~[elasticsearch-5.6.6.jar:5.6.6]

  2. These Warning in file ._deprecation.log : [2018-02-06T10:00:07,585][WARN ][o.e.d.i.m.TypeParsers ] Expected a boolean [true/false] for property [index] but got [not_analyzed]

  3. I installed elasticHQ tool that monitors cluster status. I will try to attach photo from it here.

  4. I also didi a heap dump of one of the elastic processes, but have yet to analyze it. I also have


(Alexander Reelsen) #2

your terms aggregation does not specify a field to aggregate on, neither a script, thus this error message

 value source config is invalid; must have either a field context or a script or marked as unwrapped

Can you try adding a field?


#3

spinscale, thanks a lot for your answer. To be honest I never heared of aggregations in elasticsearch context, but I see there are lots of documents about it and I am just reading it. So I dont have any customisations about aggregations, all is default. Regarding yout hint , where should I add a field ? Can you give me some pointer or example .-)


(Alexander Reelsen) #4

well, you need to find out then, who is firing that query that you pasted above. Someone is hitting this on your cluster and you need to know who (or which application). It does not look like a Kibana query to me, but I may be wrong here, so dont rule that out fully.


(Alexander Reelsen) #5

in addition, I am not sure if this is the cause of your nodes hanging.

What exactly do you mean with the system not responding? Can you run curl against it or does it hang already? Have you checked the logfiles of the system for a high amount of garbage collection? Is only one system affected by this or all?


#6

I would say the query must be coming from graylog , which is connected to this elasticsearch cluster. But we are now still in kind of test mode so practically no queries are done by users, as our first goal is to provide for stable storing of all incoming messages and later move to queries. About more detail his cluster with very similar setup was running for about 3 months, but with lower load, I would say about 1000 msg/ sec. Now it is ten times and more bigger.

Since yesterday I did one other change, on all elasticsearch nodes I executed sudo swapoff -a
and now it really doesnt use any swap at all, as shown also by elastichq. Since that change there are a bit less GC INFO or WARN messages. Yesterday we got just one gc event at 19:44.

[2018-02-06T19:44:27,446][WARN ][o.e.m.j.JvmGcMonitorService] [RhfJyb1] [gc][35315] overhead, spent [1.6s] collecting in the last [2.5s]

3 hours later this node (master) proces stopped responding, also to curl.

First bellow are messages in log of other node(.208) around time that master (.204) stopped responding:
Later are messages from this new master.207 also going on for a minute or 2 further, but then they also stop basically at 22:45 . It looks like the move of master form .204 to .207 did not succeed fully,because I stll see .204 mentioned as "master" in this new master logfiles:

.208Log:


[2018-02-06T22:41:48,616][WARN ][o.e.d.z.ZenDiscovery ] [aJ5VQPY] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{Xu4ewOU}{Xu4ewOU_QwatqMKEXE8ofw}{OLo23LSVTp6bt1hGO2jnYA}..

[2018-02-06T22:43:53,262][INFO ][o.e.c.s.ClusterService ] [aJ5VQPY] removed {{RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}{10.16.10.204:9300},}, reason: zen-disco-receive(from master [master {Xu4ewOU}{Xu4ewOU_QwatqMKEXE8ofw}{OLo23LSVTp6bt1hGO2jnYA}{10.16.10.207}{10.16.10.207:9300} committed version [2900]])

[2018-02-06T22:43:59,739][DEBUG][o.e.a.a.i.g.TransportGetIndexAction] [aJ5VQPY] timed out while retrying [indices:admin/get] after failure (timeout [30s])[10.16.10.204:9300][indices:admin/get] disconnected
[2018-02-06T22:43:59,739][WARN ][r.suppressed ] path: /graylog_deflector/_settings, params: {index=graylog_deflector, type=_settings}
org.elasticsearch.discovery.MasterNotDiscoveredException: NodeDisconnectedException[[RhfJyb1][10.16.10.204:9300][indices:admin/get] disconnected]


Log from new master candidate .207 ;


[2018-02-06T22:42:19,559][DEBUG][o.e.a.a.i.g.TransportGetIndexAction] [Xu4ewOU] no known master node, scheduling a retry
[2018-02-06T22:43:52,482][INFO ][o.e.c.r.a.AllocationService] [Xu4ewOU] Cluster health status changed from [GREEN] to [RED] (reason: [{RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}{10.16.10.204:9300} failed to ping, tried [3] times, each with maximum [30s] timeout]).

[2018-02-06T22:43:52,488][INFO ][o.e.c.s.ClusterService ] [Xu4ewOU] removed {{RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}{10.16.10.204:9300},}, reason: zen-disco-node-failed({RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}
{10.16.10.204:9300}), reason(failed to ping, tried [3] times, each with maximum [30s] timeout)[{RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}{10.16.10.204:9300} failed to ping, tried [3] times, each with maximum [30s] timeout]
,,,
[2018-02-06T22:43:58,731][WARN ][o.e.c.a.s.ShardStateAction] [Xu4ewOU] [graylog_65][1] received shard failed for shard id [[graylog_65][1]], allocation id [jzCXwG
xQSNysMZPk8zqBNA], primary term [2], message [mark copy as stale]
....
[2018-02-06T22:43:59,732][WARN ][o.e.c.a.s.ShardStateAction] [Xu4ewOU] [win_467][2] received shard failed for shard id [[win_467][2]], allocation id [S220jxkUQ_WhNEm_DYmhrA], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [win_467][2], node[RhfJyb1WR-aHEUL-D-FJ-w], [R], s[STARTED], a[id=S220jxkUQ_WhNEm_DYmhrA]], failure [NodeDisconnectedException[[RhfJyb1][10.16.10.204:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [RhfJyb1][10.16.10.204:9300][indices:data/write/bulk[s][r]] disconnected
...
[2018-02-06T22:44:06,810][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [Xu4ewOU] failed to execute on node [RhfJyb1WR-aHEUL-D-FJ-w]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [RhfJyb1][10.16.10.204:9300][cluster:monitor/nodes/stats[n]] request_id [724611] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:961) [elasticsearch-5.6.6.jar:5.6.6]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.6.jar:5.6.6]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_144]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_144]
at java.lang.Thread.run(Unknown Source) [?:1.8.0_144]
[2018-02-06T22:44:07,180][DEBUG][o.e.a.a.c.h.TransportClusterHealthAction] [Xu4ewOU] connection exception while trying to forward request with action name [cluster:monitor/health] to master node [{RhfJyb1}{RhfJyb1WR-aHEUL-D-FJ-w}{Cr5o0JPLQOyNP4g_xT7sSg}{10.16.10.204}{10.16.10.204:9300}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [RhfJyb1][192.168.10.204:9300][cluster:monitor/health] disconnected]
....
[2018-02-06T22:44:07,184][DEBUG][o.e.a.a.i.s.TransportIndicesStatsAction] [Xu4ewOU] failed to execute [indices:monitor/stats] on node [RhfJyb1WR-aHEUL-D-FJ-w]
org.elasticsearch.transport.NodeDisconnectedException: [RhfJyb1][10.16.10.204:9300][indices:monitor/stats[n]] disconnected
...
<AND the last 2 messages in "new master .207 log>
[2018-02-06T22:44:08,059][WARN ][o.e.c.a.s.ShardStateAction] [Xu4ewOU] [win_467][3] received shard failed for shard id [[win_467][3]], allocation id [3uirSXTjSiO_J6v1e4_7Zw], primary term [2], message [mark copy as stale]
[2018-02-06T23:25:24,257][INFO ][o.e.c.m.MetaDataMappingService] [Xu4ewOU] [win_467/d3aNKFmTRMmfn4WHyfW_NQ] update_mapping [message]


#7

Similar hang just happened again. It is quite often that we reproduce that problem. I tried to connect to it using jmpa or jcmd, but it fails saying "Unable to open socket file: target process not responding or HotSpot VM not loaded". Same error also when usin -histo option. It seemed to work better for jmap -F option, but after many minutes it returned another error:

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.tools.jmap.JMap.runTool(JMap.java:201)
at sun.tools.jmap.JMap.main(JMap.java:130)
Caused by: java.lang.ExceptionInInitializerError
...
Caused by: java.lang.RuntimeException: field "_bmStartWord" not found in type CMSBitMap


#8

It seems like the hanging process creates kind of deadlock for the whole cluster. After process on one of the nodes started hanging (and didnt respond to curl ), elasticsearch didnt work with messages and all logs on other nodes were stopped (despite the other nodes answering to curl). As soon as I killed the hangign process with kill -9, the cluster started processing again and logs on other node continued logging, here example from master node ;

[2018-02-09T11:44:17,480][WARN ][o.e.c.a.s.ShardStateAction] [RhfJyb1] [smsc_87][1] received shard failed for shard id [[smsc_87][1]], allocation id [3ma2HSiXTROdvaapMNfI7g], primary term [2], message [mark copy as stale]
[2018-02-09T14:15:59,530][INFO ][o.e.c.m.MetaDataCreateIndexService] [RhfJyb1] [srx_66] creating index, cause [api], templates [srx-template], shards [8]/[0], mappings [message]


#9

I found information, that elasticsearch is only supported on "Oracle Enterprise
Linux 6/7 with RHEL Kernel only". I checked our systems and we have other kernel,
"Unbreakable Enterprise Kernel " . Are there any known problems of users running on
this Kernel and Oracle Linux ?


#10

After changing the kernel the elasticsearch is runnig without the hang for more than 2 days, so it looks like the kernel was the reason for hang of elasticsearch process.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.