Watcher Version conflict exception in Elasticsearch

r.ganeshbabu · January 7, 2016, 5:27am

Hi All,

I am having an ES cluster with 3 master nodes, 3 data nodes & 1 client node with ES version 1.7.3. I configured watcher for alerts mail notifications. I am getting email based on the status of the cluster health but In master log I can see some version conflict exception where thrown for watcher,

[2016-01-05 23:58:23,490][WARN ][watcher.execution ] [dayrfghde230_MASTER] failed to update watch status [cluster_health_watch_37102-2016-01-06T04:58:23.483Z]
org.elasticsearch.index.engine.VersionConflictEngineException: [.watches][0] [watch][cluster_health_watch]: version conflict, current [10156], provided [10155]
at org.elasticsearch.index.engine.Engine.getFromSearcher(Engine.java:231)
at org.elasticsearch.index.engine.InternalEngine.get(InternalEngine.java:246)
at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:586)
at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:193)
at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:104)
at org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:78)
at org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:176)
at org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:170)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$1.run(TransportInstanceSingleOperationAction.java:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Below is the watcher email configuration setup from master yml,
watcher.actions.email.service.account:
watcher_account:
profile: standard
smtp:
auth: false
starttls.enable: false
host: smarthost.enterprise.com
port: 25
user:
password:

What is the reason for getting version conflict warning exceptions in log?

Please guide me any suggestions,

Thanks,
Ganeshbabu R

spinscale · January 8, 2016, 12:52pm

Hey,

just wildly guessing here for now. Can you please check your logs?

is it possible that you had a master reelection happening while a watch was running? Is it possible that the master was maybe stuck in a GC (which results in the JVM hanging and not doing anything) and during that time a master reelection happened and the watch was executed on the newly elected master node and only then the other former master node came back to live and finished its execution?

As I said, this is just an outside guess based on the few information available. Would be nice if you told us, if there is anything in your log files, so we can dig deeper.

--Alex

r.ganeshbabu · January 11, 2016, 11:34am

Sorry for late reply @spinscale. Ya I can see in the data log Garbage Collection were happened continuously, Below is the log details,

[2016-01-07 06:04:58,446][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4334433][115392] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[2.4h], memory [25.7gb]->[25gb]/[29.7gb], all_pools {[young] [1.4gb]->[8.8mb]/[2.1gb]}{[survivor] [127.2mb]->[274.5mb]/[274.5mb]}{[old] [24.2gb]->[24.7gb]/[27.3gb]}
[2016-01-07 06:30:04,117][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4335936][115510] duration [1.7s], collections [1]/[2.7s], total [1.7s]/[2.4h], memory [26.9gb]->[25.6gb]/[29.7gb], all_pools {[young] [2gb]->[21.1mb]/[2.1gb]}{[survivor] [257.5mb]->[274.5mb]/[274.5mb]}{[old] [24.5gb]->[25.3gb]/[27.3gb]}
[2016-01-07 13:47:34,282][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4362169][117458] duration [1.8s], collections [1]/[1.8s], total [1.8s]/[2.4h], memory [26.3gb]->[25.1gb]/[29.7gb], all_pools {[young] [2.1gb]->[17.4mb]/[2.1gb]}{[survivor] [98.7mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.8gb]/[27.3gb]}
[2016-01-07 13:49:33,147][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4362287][117464] duration [1.2s], collections [1]/[1.7s], total [1.2s]/[2.4h], memory [26.6gb]->[25.4gb]/[29.7gb], all_pools {[young] [2gb]->[21.7mb]/[2.1gb]}{[survivor] [274.5mb]->[274.5mb]/[274.5mb]}{[old] [24.3gb]->[25.1gb]/[27.3gb]}
[2016-01-07 14:23:06,623][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4364298][117581] duration [1.6s], collections [1]/[2.1s], total [1.6s]/[2.4h], memory [26.5gb]->[25.2gb]/[29.7gb], all_pools {[young] [2.1gb]->[17.9mb]/[2.1gb]}{[survivor] [256.2mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.9gb]/[27.3gb]}
[2016-01-07 16:21:16,204][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371382][117864] duration [1.8s], collections [1]/[2.6s], total [1.8s]/[2.4h], memory [26.2gb]->[24.8gb]/[29.7gb], all_pools {[young] [2.1gb]->[15.7mb]/[2.1gb]}{[survivor] [243.5mb]->[274.5mb]/[274.5mb]}{[old] [23.9gb]->[24.6gb]/[27.3gb]}
[2016-01-07 16:24:19,262][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371564][117879] duration [807ms], collections [1]/[1.5s], total [807ms]/[2.4h], memory [25.3gb]->[24.5gb]/[29.7gb], all_pools {[young] [1gb]->[4.7mb]/[2.1gb]}{[survivor] [221.7mb]->[274.5mb]/[274.5mb]}{[old] [24gb]->[24.2gb]/[27.3gb]}
[2016-01-07 16:25:14,060][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371618][117882] duration [921ms], collections [1]/[1.7s], total [921ms]/[2.4h], memory [26.2gb]->[24.6gb]/[29.7gb], all_pools {[young] [2gb]->[6.3mb]/[2.1gb]}{[survivor] [243.9mb]->[274.5mb]/[274.5mb]}{[old] [23.9gb]->[24.3gb]/[27.3gb]}
[2016-01-07 16:52:58,336][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373281][117971] duration [764ms], collections [1]/[1.4s], total [764ms]/[2.4h], memory [25.9gb]->[24.6gb]/[29.7gb], all_pools {[young] [1.7gb]->[38.2mb]/[2.1gb]}{[survivor] [47.4mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.3gb]/[27.3gb]}
[2016-01-07 16:53:47,266][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373329][117977] duration [966ms], collections [1]/[1.9s], total [966ms]/[2.4h], memory [26gb]->[24.7gb]/[29.7gb], all_pools {[young] [1.7gb]->[2.2mb]/[2.1gb]}{[survivor] [50.9mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.5gb]/[27.3gb]}
[2016-01-07 16:55:10,028][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373411][117981] duration [1.6s], collections [1]/[1.7s], total [1.6s]/[2.4h], memory [26.3gb]->[25gb]/[29.7gb], all_pools {[young] [2.1gb]->[15.2mb]/[2.1gb]}{[survivor] [92.1mb]->[274.5mb]/[274.5mb]}{[old] [24gb]->[24.7gb]/[27.3gb]}
[2016-01-07 16:56:18,483][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373479][117985] duration [920ms], collections [1]/[1.4s], total [920ms]/[2.4h], memory [25.8gb]->[24.6gb]/[29.7gb], all_pools {[young] [1.5gb]->[5.1mb]/[2.1gb]}{[survivor] [176.4mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.3gb]/[27.3gb]}

Thanks,
Ganeshbabu R

spinscale · January 12, 2016, 8:47am

Hey,

those GCs dont seem to run long enough to cause a master node reelection on first view, see the durations, never more than two seconds. Do you see a master node reelection in the logs?

--Alex

r.ganeshbabu · January 12, 2016, 11:54am

No Alex, I didn't see any master reelection were happened and I thoroughly checked the log files.

spinscale · January 13, 2016, 8:27am

Heya,

is there any chance you can share the config and the log files from all your cluster nodes around 2016-01-05 23:58 (plus minus a few minutes), so we can debug what is the underlying cause?

Generally speaking it looks as if a watch was executed, the execution stopped (this could happen because of a GC that stops the JVM or maybe because the watch itself was hanging, i.e. due to a network problem when executing a HTTP request). Then watch was successfully executed somewhere else and wrote into the watch status. This blocked watch was unblocked, tried to write into the statys as well, but found out there already has an entry been added inbetween, thus the version exception. This means your watches were running all the time.

--Alex

r.ganeshbabu · January 13, 2016, 10:27am

Thanks for your response @spinscale

How can i upload the log & config files in this forum?

I can see upload button but it only allows image file(jpeg, png).

Please let me know.

Regards,
Ganeshbabu R

spinscale · January 14, 2016, 8:51am

Hey,

This forum does not support such uploads. Using some pastebin service or create a gist might be best.

--Alex

Topic		Replies	Views
Watcher Version Error Elasticsearch elastic-stack-alerting	2	1604	July 6, 2017
Watcher VersionConflictEngineException Elasticsearch	1	990	April 17, 2017
Getting lots of watcher errors since upgrade to 6.3.1 Elasticsearch elastic-stack-alerting	18	1501	August 20, 2018
Failed to start watcher. please wait for the cluster to become ready or try to start Watcher manully Elasticsearch elastic-stack-alerting	8	4032	July 6, 2017
Illegal_state_exception is preventing deletion of watches Elasticsearch elastic-stack-alerting	21	4726	July 6, 2017

Watcher Version conflict exception in Elasticsearch

Related topics