ElasticSearch 6.1.2 keeps going down . failed to update watch record

alerting

(manish joshi) #1

Hi All,

My elastic search cluster which is a 3 node data cluster with all of them being master eligible keeps going down randomly. Logs shows that it could not write to some watcher index.

Any help would be greatly appreciated.

[2018-01-31T20:03:36,645][ERROR][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to update watch record [KMWzyCa-SgKusHgnc9MrxA_xpack_license_expiration_290aa911-2e8b-441b-8872-219cfe05394d-2018-01-31T20:02:35.261Z]

Elastic cluster is 6.1.2 with x-pack and s3-repository plugins installed. The nodes randomly goes down all the time. Could someone help me with this. I have attached the log below.

[2018-01-31T18:32:40,701][INFO ][o.e.c.s.ClusterApplierService] [onepay-es-master3] added {{onepay-es-master2}{BNP894N1RfmGNhQZzIrgWQ}{aTCLugUdTnqmTs6zmNVVQQ}{onepay-es-m2.aws-cb-dqa.cb4good.com}{10.203.153.210:9300}{ml.machine_memory=16656961536, ml.max_open_jobs=20, ml.enabled=true},}, reason: apply cluster state (from master [master {onepay-es-master3}{YzuDNdK2RluPOw3tm-rv5A}{AHE6-ZGHSw21gAtQLEp1fw}{onepay-es-m3.aws-cb-dqa.cb4good.com}{10.203.156.83:9300}{ml.machine_memory=16656961536, ml.max_open_jobs=20, ml.enabled=true} committed version [174] source [zen-disco-node-join[{onepay-es-master2}{BNP894N1RfmGNhQZzIrgWQ}{aTCLugUdTnqmTs6zmNVVQQ}{onepay-es-m2.aws-cb-dqa.cb4good.com}{10.203.153.210:9300}{ml.machine_memory=16656961536, ml.max_open_jobs=20, ml.enabled=true}]]])
[2018-01-31T18:32:40,753][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [onepay-es-master3] updating number_of_replicas to [2] for indices [.security-6]
[2018-01-31T18:32:41,110][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [onepay-es-master3] [.security-6/xIXZLdJ8QPiCofjD923lyQ] auto expanded replicas to [2]
[2018-01-31T18:32:52,100][INFO ][o.e.c.r.a.AllocationService] [onepay-es-master3] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[onepaybt-1][0]] ...]).
[2018-01-31T18:34:35,823][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_xpack_license_expiration]
[2018-01-31T18:35:36,105][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_version_mismatch]
[2018-01-31T18:35:36,357][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_cluster_status]
[2018-01-31T18:53:19,123][INFO ][o.e.c.s.IndexScopedSettings] [onepay-es-master3] updating [index.search.slowlog.threshold.query.trace] from [-1] to [100ms]
[2018-01-31T18:53:19,142][INFO ][o.e.c.s.IndexScopedSettings] [onepay-es-master3] updating [index.search.slowlog.level] from [TRACE] to [trace]
[2018-01-31T18:53:19,221][INFO ][o.e.c.s.IndexScopedSettings] [onepay-es-master3] updating [index.search.slowlog.threshold.query.trace] from [-1] to [100ms]
[2018-01-31T18:53:19,222][INFO ][o.e.c.s.IndexScopedSettings] [onepay-es-master3] updating [index.search.slowlog.level] from [TRACE] to [trace]
[2018-01-31T19:08:35,785][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_logstash_version_mismatch]
[2018-01-31T19:14:35,641][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_kibana_version_mismatch]
[2018-01-31T19:14:35,741][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_cluster_status]
[2018-01-31T19:17:35,767][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_version_mismatch]
[2018-01-31T19:35:36,100][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_logstash_version_mismatch]
[2018-01-31T19:49:36,204][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_kibana_version_mismatch]
[2018-01-31T20:00:38,401][INFO ][o.e.m.j.JvmGcMonitorService] [onepay-es-master3] [gc][27154] overhead, spent [290ms] collecting in the last [1s]
[2018-01-31T20:01:38,189][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_kibana_version_mismatch]
[2018-01-31T20:02:45,744][ERROR][o.e.x.m.c.i.IndexStatsCollector] [onepay-es-master3] collector [index-stats] timed out when collecting data
[2018-01-31T20:02:56,166][ERROR][o.e.x.m.c.m.JobStatsCollector] [onepay-es-master3] collector [job_stats] timed out when collecting data
[2018-01-31T20:03:06,043][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_kibana_version_mismatch]
[2018-01-31T20:03:06,043][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_cluster_status]
[2018-01-31T20:03:06,183][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [onepay-es-master3] collector [cluster_stats] timed out when collecting data
[2018-01-31T20:03:06,460][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_xpack_license_expiration]
[2018-01-31T20:03:06,460][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_logstash_version_mismatch]
[2018-01-31T20:03:07,285][WARN ][o.e.x.w.e.ExecutionService] [onepay-es-master3] failed to execute watch [KMWzyCa-SgKusHgnc9MrxA_elasticsearch_version_mismatch]
[2018-01-31T20:03:10,933][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [onepay-es-master3] failed to execute on node [_cwIruk_QhGmksnVxRiB2w]


(Alexander Reelsen) #2

can you explain what you mean with 'going down'? Is the process killed or is it in a ghost state? Can you also include all logs until the startup logs are shown to make sure you dont miss anything?

Have you checked that the oom killer does not do kill the process (can be checked via dmesg), just to rule this out.


(manish joshi) #3

The process is killed in linux.

I see this running dmesg
[82557.278581] Out of memory: Kill process 3068 (java) score 952 or sacrifice child
[82557.282325] Killed process 3068 (java) total-vm:20779748kB, anon-rss:15461568kB, file-rss:0kB, shmem-rss:0kB


(Alexander Reelsen) #4

how much memory do you have configured? How much memory is available? Are there other services running?

the oom killer usually just kills the process with the most memory allocated, which is Elasticsearch. However you need to find out who consumes the other memory. Are you doing anything else than running ES on that system?


(manish joshi) #5

the only thing running is elastic search however what i noticed is the node which is the master goes down first. i think i have 12 gbs of memory allocated to elastic search. does it have something to do with the watcher index.


(Alexander Reelsen) #6

It is unlikely that this has something to do with the watcher index, if your node gets killed by the operating system.

You need to monitor that system and find out what else is eating up your memory (or if it is a memory leak within Elasticsearch).

Which Elasticsearch version is this?


(manish joshi) #7

elastic 6.1.2. I have nothing but elasticsearch installed in the ec2 instance.


(Alexander Reelsen) #8

if you log into that system and call curl locally, then nothing gets returned as well? Just to rule out the network.

Is there anything new logged once the system is in this broken state?

Have you checked the dmesg output and syslogs for anything suspicious?

It is pretty rare that the process just hangs without doing anything, thus the questions.


(manish joshi) #9

Yes. Nothing gets returned when i do a curl. Nothing new gets logged. Dmesg just gives me the same output that i posted in this thread. Also i'm currently getting this error:

org.elasticsearch.script.ScriptException: runtime error
at org.elasticsearch.painless.PainlessScript.convertToScriptException(PainlessScript.java:101) ~[?:?]
at org.elasticsearch.painless.PainlessScript$Script.execute(ctx.vars.email_recipient = (ctx.payload.kibana_settings.hits.total > 0) ? ctx.payload.kibana_settings.hits.hits[0]._source.kibana_settings.xpack.default_admin_email : null;ctx.vars.is_new = ctx.vars.fails_check && !ctx.vars.not_resolved;ctx.vars.is_resolve ...:1070) ~[?:?]

Not executing any queries and this instance is not being used. I just created the cluster and restored it from a snapshot. No inserts or reads happening in this cluster.


(Alexander Reelsen) #10

Ok, so can you add monitoring to your node and find out which process is eating the memory? If it is elasticsearch or something else?

You could use something like

pidstat -r -p $PID_OF_ES  5

to monitor and write the memory consumption of Elasticsearch into a file. That should be constant what you roughly configured as heap size, if it keeps increasing, we got the culprit.

pidstat is part of the sysstat package


(Alexander Reelsen) #11

Hey,

actually, remove the -p parameter, so you log all the processes, should make it easier to find the offender, if there if one.

--Alex


(Alexander Reelsen) #12

Hey,

any update here so far?


(manish joshi) #13

Yep, So we had some JVM tuning done. I see that elasticsearch is using about 40% memory and putting it under load it goes up but garbage collection happens before it goes out of memory. I think what we were missing in our cluster was proper JVM tuning.

Will post updates if i come across something else.

Regards,
Manish


(Alexander Reelsen) #14

I dont think that JVM tuning will solve your issue here. You might solve symptons, but not the underlying issue, hopefully long enough while running.


(Luo Huan Feng) #15

I have encountered the same problem...After rebooting the cluster.
use /etc/init.d/elasticsearch restart


(system) #16

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.