Watcher Version conflict exception in Elasticsearch

alerting

(ganeshbabu) #1

Hi All,

I am having an ES cluster with 3 master nodes, 3 data nodes & 1 client node with ES version 1.7.3. I configured watcher for alerts mail notifications. I am getting email based on the status of the cluster health but In master log I can see some version conflict exception where thrown for watcher,

[2016-01-05 23:58:23,490][WARN ][watcher.execution ] [dayrfghde230_MASTER] failed to update watch status [cluster_health_watch_37102-2016-01-06T04:58:23.483Z]
org.elasticsearch.index.engine.VersionConflictEngineException: [.watches][0] [watch][cluster_health_watch]: version conflict, current [10156], provided [10155]
at org.elasticsearch.index.engine.Engine.getFromSearcher(Engine.java:231)
at org.elasticsearch.index.engine.InternalEngine.get(InternalEngine.java:246)
at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:586)
at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:193)
at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:104)
at org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:78)
at org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:176)
at org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:170)
at org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$1.run(TransportInstanceSingleOperationAction.java:187)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

Below is the watcher email configuration setup from master yml,
watcher.actions.email.service.account:
watcher_account:
profile: standard
smtp:
auth: false
starttls.enable: false
host: smarthost.enterprise.com
port: 25
user:
password:

What is the reason for getting version conflict warning exceptions in log?

Please guide me any suggestions,

Thanks,
Ganeshbabu R


(Alexander Reelsen) #2

Hey,

just wildly guessing here for now. Can you please check your logs?

is it possible that you had a master reelection happening while a watch was running? Is it possible that the master was maybe stuck in a GC (which results in the JVM hanging and not doing anything) and during that time a master reelection happened and the watch was executed on the newly elected master node and only then the other former master node came back to live and finished its execution?

As I said, this is just an outside guess based on the few information available. Would be nice if you told us, if there is anything in your log files, so we can dig deeper.

--Alex


(ganeshbabu) #3

Sorry for late reply @spinscale. Ya I can see in the data log Garbage Collection were happened continuously, Below is the log details,

[2016-01-07 06:04:58,446][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4334433][115392] duration [1.3s], collections [1]/[1.7s], total [1.3s]/[2.4h], memory [25.7gb]->[25gb]/[29.7gb], all_pools {[young] [1.4gb]->[8.8mb]/[2.1gb]}{[survivor] [127.2mb]->[274.5mb]/[274.5mb]}{[old] [24.2gb]->[24.7gb]/[27.3gb]}
[2016-01-07 06:30:04,117][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4335936][115510] duration [1.7s], collections [1]/[2.7s], total [1.7s]/[2.4h], memory [26.9gb]->[25.6gb]/[29.7gb], all_pools {[young] [2gb]->[21.1mb]/[2.1gb]}{[survivor] [257.5mb]->[274.5mb]/[274.5mb]}{[old] [24.5gb]->[25.3gb]/[27.3gb]}
[2016-01-07 13:47:34,282][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4362169][117458] duration [1.8s], collections [1]/[1.8s], total [1.8s]/[2.4h], memory [26.3gb]->[25.1gb]/[29.7gb], all_pools {[young] [2.1gb]->[17.4mb]/[2.1gb]}{[survivor] [98.7mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.8gb]/[27.3gb]}
[2016-01-07 13:49:33,147][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4362287][117464] duration [1.2s], collections [1]/[1.7s], total [1.2s]/[2.4h], memory [26.6gb]->[25.4gb]/[29.7gb], all_pools {[young] [2gb]->[21.7mb]/[2.1gb]}{[survivor] [274.5mb]->[274.5mb]/[274.5mb]}{[old] [24.3gb]->[25.1gb]/[27.3gb]}
[2016-01-07 14:23:06,623][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4364298][117581] duration [1.6s], collections [1]/[2.1s], total [1.6s]/[2.4h], memory [26.5gb]->[25.2gb]/[29.7gb], all_pools {[young] [2.1gb]->[17.9mb]/[2.1gb]}{[survivor] [256.2mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.9gb]/[27.3gb]}
[2016-01-07 16:21:16,204][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371382][117864] duration [1.8s], collections [1]/[2.6s], total [1.8s]/[2.4h], memory [26.2gb]->[24.8gb]/[29.7gb], all_pools {[young] [2.1gb]->[15.7mb]/[2.1gb]}{[survivor] [243.5mb]->[274.5mb]/[274.5mb]}{[old] [23.9gb]->[24.6gb]/[27.3gb]}
[2016-01-07 16:24:19,262][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371564][117879] duration [807ms], collections [1]/[1.5s], total [807ms]/[2.4h], memory [25.3gb]->[24.5gb]/[29.7gb], all_pools {[young] [1gb]->[4.7mb]/[2.1gb]}{[survivor] [221.7mb]->[274.5mb]/[274.5mb]}{[old] [24gb]->[24.2gb]/[27.3gb]}
[2016-01-07 16:25:14,060][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4371618][117882] duration [921ms], collections [1]/[1.7s], total [921ms]/[2.4h], memory [26.2gb]->[24.6gb]/[29.7gb], all_pools {[young] [2gb]->[6.3mb]/[2.1gb]}{[survivor] [243.9mb]->[274.5mb]/[274.5mb]}{[old] [23.9gb]->[24.3gb]/[27.3gb]}
[2016-01-07 16:52:58,336][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373281][117971] duration [764ms], collections [1]/[1.4s], total [764ms]/[2.4h], memory [25.9gb]->[24.6gb]/[29.7gb], all_pools {[young] [1.7gb]->[38.2mb]/[2.1gb]}{[survivor] [47.4mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.3gb]/[27.3gb]}
[2016-01-07 16:53:47,266][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373329][117977] duration [966ms], collections [1]/[1.9s], total [966ms]/[2.4h], memory [26gb]->[24.7gb]/[29.7gb], all_pools {[young] [1.7gb]->[2.2mb]/[2.1gb]}{[survivor] [50.9mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.5gb]/[27.3gb]}
[2016-01-07 16:55:10,028][WARN ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373411][117981] duration [1.6s], collections [1]/[1.7s], total [1.6s]/[2.4h], memory [26.3gb]->[25gb]/[29.7gb], all_pools {[young] [2.1gb]->[15.2mb]/[2.1gb]}{[survivor] [92.1mb]->[274.5mb]/[274.5mb]}{[old] [24gb]->[24.7gb]/[27.3gb]}
[2016-01-07 16:56:18,483][INFO ][monitor.jvm ] [dayrhefgmh002_PROD_DATA] [gc][young][4373479][117985] duration [920ms], collections [1]/[1.4s], total [920ms]/[2.4h], memory [25.8gb]->[24.6gb]/[29.7gb], all_pools {[young] [1.5gb]->[5.1mb]/[2.1gb]}{[survivor] [176.4mb]->[274.5mb]/[274.5mb]}{[old] [24.1gb]->[24.3gb]/[27.3gb]}

Thanks,
Ganeshbabu R


(Alexander Reelsen) #4

Hey,

those GCs dont seem to run long enough to cause a master node reelection on first view, see the durations, never more than two seconds. Do you see a master node reelection in the logs?

--Alex


(ganeshbabu) #5

No Alex, I didn't see any master reelection were happened and I thoroughly checked the log files.


(Alexander Reelsen) #6

Heya,

is there any chance you can share the config and the log files from all your cluster nodes around 2016-01-05 23:58 (plus minus a few minutes), so we can debug what is the underlying cause?

Generally speaking it looks as if a watch was executed, the execution stopped (this could happen because of a GC that stops the JVM or maybe because the watch itself was hanging, i.e. due to a network problem when executing a HTTP request). Then watch was successfully executed somewhere else and wrote into the watch status. This blocked watch was unblocked, tried to write into the statys as well, but found out there already has an entry been added inbetween, thus the version exception. This means your watches were running all the time.

--Alex


(ganeshbabu) #7

Thanks for your response @spinscale

How can i upload the log & config files in this forum?

I can see upload button but it only allows image file(jpeg, png).

Please let me know.

Regards,
Ganeshbabu R


(Alexander Reelsen) #8

Hey,

This forum does not support such uploads. Using some pastebin service or create a gist might be best.

--Alex


(system) #9