Watcher throwing thread pool capacity error

alerting

(Sumit Arora) #1

Hello Folks,

We're running watcher beta version with few watches configured. After running for few days, we started getting "failed to run triggered watch [someID] due to thread pool capacity" error which stopped the watcher execution at all. None of the watches are getting triggered due to this issue. Help in this regard would be greatly appreciated.

Thanks
Sumit


(Martijn Van Groningen) #2

Hi Sumit,

A few questions:

  • How many watches have configured?
  • How often are these Watches configured to trigger?
  • What is the nature of your watches? For example are watches configured to send emails?
  • Can you share the output of the watch info api:

curl 'localhost:9200/_watcher?pretty'

Then we can see with what version you're running with.

  • Can you share the output of the watch stats api:

curl 'localhost:9200/_watcher/stats?_all&pretty&emit_stacktraces'

If the problem is still occurring we can see what watch is potentially causing this problem.

Martijn


(Sumit Arora) #3

Thanks Martijn for quick reply.

Pls find inline answers to your questions below

  • How many watches have configured? Total 7 watches
  • How often are these Watches configured to trigger? One about 15s(the one which threw error initially see below) and remaining 6 watches 1m
  • What is the nature of your watches? For example are watches configured to send emails? Yes, all watchers are configured to send mail
  • Can you share the output of the watch info api:

GET /_watcher?pretty
{
"version": {
"name": "1.0.0-Beta1",
"number": "1.0.0.Beta1",
"build_hash": "b4bc48a200d93296b4731641f6dbdf81feb46cbf",
"build_timestamp": "2015-05-20T12:29:28Z",
"build_snapshot": false
}
}

Can you share the output of the watch stats api:

curl 'localhost:9200/_watcher/stats?all&pretty&emitstacktraces'

{
"watcher_state": "started",
"watch_count": 7,
"execution_queue": {
"size": 1000,
"max_size": 40
}
}

Below is the watch started giving the problem initially, got the error through watch history index. After this we could see stae failures of all other watches

{
"_index": ".watch_history-2015.07.01",
"_type": "watch_record",
"_id": "82b8ab81-a5e8-4ebd-ba89-32f91ea79a41_419-2015-07-01T01:18:55.196Z",
"_score": null,
"_source": {
"watch_id": "82b8ab81-a5e8-4ebd-ba89-32f91ea79a41",
"trigger_event": {
"schedule": {
"triggered_time": "2015-07-01T01:18:55.196Z",
"scheduled_time": "2015-07-01T01:18:55.139Z"
}
},
"input": {
"search": {
"request": {
"search_type": "query_then_fetch",
"indices": [
"workflow_test_connectors"
],
"types": [
"processingworkflowbatchrequests"
],
"body": {
"aggs": {
"status_count": {
"filter": {
"term": {
"workflowRequestId": "82b8ab81"
}
},
"aggs": {
"status_num": {
"terms": {
"field": "status"
}
}
}
}
}
}
}
}
},
"condition": {
"always": {}
},
"state": "failed",
"message": "failed to run triggered watch [82b8ab81-a5e8-4ebd-ba89-32f91ea79a41] due to thread pool capacity"
},
"fields": {
"trigger_event.schedule.triggered_time": [
1435713535196
],
"execution_result.execution_time": [
null
],
"trigger_event.schedule.scheduled_time": [
1435713535139
]
},
"sort": [
1435713535196
]
}

Thanks,
Sumit


(Martijn Van Groningen) #4

Unfortunately the watch stats api doesn't give the output I hoped to see. Can you upgrade to use the the latest Watcher release? (1.0) The watch stats api has been updated between this beta1 and the 1.0 release to include more information.

If you upgrade be sure to follow the upgrade instructions (deleting existing watches etc.) otherwise Watcher can't start properly.

I think a watch executions are stuck and this blocks other watch executions. If you upgrade to 1.0 the watch stats api can confirm this. Also the 1.0 release is a bit more aggressive to the operations that hang for a long time / forever. (for example search input that takes to long to execute will be timed out)


#5

Hi @mvg I work with @sumithub and thought I would try and get you a response before the weekend.

In regards to upgrading I am sure that will be fine however there is one watch that can't be deleted from one of the nodes. Is there a way to force delete a watch or something? I don't have the error message we get when trying to delete that watch but we can add it Monday to the thread if you need to see it.


(Martijn Van Groningen) #6

You can force a watch to be deleted that is currently being used by using the force option:
DELETE _watcher/watch/my-watch?force

Alternatively you can also stop the all nodes, uninstall watcher on all nodes, start the nodes, delete all watcher related indices.


(Sumit Arora) #7

Hi Martijn,

After upgrading to 1.0.0 version and deleting all ole watcher indices, thread pool error is gone.

Now we're faced with another exception for the below watch. The error is getting repeated every minute.

Error:

failed to execute action [node_event/send_email]
org.elasticsearch.common.mustache.MustacheException: Index: 0, Size: 0@[query-template:1]
at org.elasticsearch.common.mustache.codes.DefaultCode.get(DefaultCode.java:107)
at org.elasticsearch.common.mustache.codes.ValueCode.execute(ValueCode.java:55)
at org.elasticsearch.common.mustache.codes.DefaultMustache.run(DefaultMustache.java:30)
at org.elasticsearch.common.mustache.codes.DefaultCode.execute(DefaultCode.java:126)
at org.elasticsearch.common.mustache.codes.DefaultCode.execute(DefaultCode.java:115)
at org.elasticsearch.watcher.support.template.xmustache.XMustacheScriptEngineService$MustacheExecutableScript.run(XMustacheScriptEngineService.java:183)
at org.elasticsearch.watcher.support.template.xmustache.XMustacheTemplateEngine.render(XMustacheTemplateEngine.java:51)
at org.elasticsearch.watcher.actions.email.service.EmailTemplate.render(EmailTemplate.java:121)
at org.elasticsearch.watcher.actions.email.ExecutableEmailAction.execute(ExecutableEmailAction.java:60)
at org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:104)
at org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:379)
at org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)
at org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at org.elasticsearch.common.collect.Iterables.get(Iterables.java:728)
at org.elasticsearch.watcher.support.template.xmustache.XMustacheFactory$CollectionMap.get(XMustacheFactory.java:137)
at org.elasticsearch.watcher.support.template.xmustache.XMustacheFactory$CollectionMap.containsKey(XMustacheFactory.java:146)
at org.elasticsearch.common.mustache.reflect.ReflectionObjectHandler.findWrapper(ReflectionObjectHandler.java:125)
at org.elasticsearch.common.mustache.reflect.ReflectionObjectHandler.find(ReflectionObjectHandler.java:72)
at org.elasticsearch.common.mustache.reflect.GuardedBinding.getWrapper(GuardedBinding.java:94)
at org.elasticsearch.common.mustache.reflect.GuardedBinding.createAndGet(GuardedBinding.java:79)
at org.elasticsearch.common.mustache.reflect.GuardedBinding.get(GuardedBinding.java:74)
at org.elasticsearch.common.mustache.codes.DefaultCode.get(DefaultCode.java:105)
... 15 more

Output from stats
GET /_watcher/stats?pretty
{
"watcher_state": "started",
"watch_count": 6,
"execution_thread_pool": {
"queue_size": 0,
"max_size": 40
}
}

PUT _watcher/watch/node_event/execute/_ack

{
"_status": {
"last_checked": "2015-07-14T23:06:43.406Z",
"last_met_condition": "2015-07-14T23:06:43.406Z",
"actions": {
"send_email": {
"ack": {
"timestamp": "2015-07-14T23:05:43.039Z",
"state": "awaits_successful_execution"
},
"last_execution": {
"timestamp": "2015-07-14T23:06:43.406Z",
"successful": false,
"reason": "MustacheException[Index: 0, Size: 0@[query-template:1]]; nested: IndexOutOfBoundsException[Index: 0, Size: 0]; "
}
}
}
}
}

watch from elasticsearch usecase example

PUT _watcher/watch/node_event
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"indices": [
".marvel-*"
],
"search_type": "query_then_fetch",
"body": {
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"match": {
"event": "node_left"
}
},
{
"match": {
"event": "node_joined"
}
}
]
}
},
"filter": {
"range": {
"@timestamp": {
"from": "{{ctx.trigger.scheduled_time}}||-60s",
"to": "{{ctx.trigger.triggered_time}}"
}
}
}
}
},
"fields": [
"event",
"message",
"cluster_name"
],
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
]
}
}
}
},
"throttle_period": "1m",
"condition": {
"script": {
"inline": "ctx.payload.hits.size() > 0 "
}
},
"actions": {
"send_email": {
"email": {
"to": "team@abc.com",
"subject": "{{ctx.payload.hits.hits.0.fields.event}} the cluster",
"body": "DEMO CLUSTER {{ctx.payload.hits.hits.0.fields.message}} the cluster {{ctx.payload.hits.hits.0.fields.cluster_name}} "
}
}
}
}

Hope above information will help to debug the issue.

Thanks,
Sumit


(Martijn Van Groningen) #8

This error is confusing and no IndexOutOfBoundsException should ever bubble up. So that should be fixed.

Are you sure that the following expressions yield results?

  • ctx.payload.hits.hits.0.fields.event
  • ctx.payload.hits.hits.0.fields.message
  • ctx.payload.hits.hits.0.fields.cluster_name

Also I think you need to use _source instead of fields. Unless you tell the _search api to return stored fields specifically then fields can be used, otherwise _source should be used, which is included by default.


(Martijn Van Groningen) #9

If you change you condition script to: ctx.payload.hits.total > 0
then I expect no history entries with errors.

Your current condition also evaluates positive when there are no hits. This allows the action to be executed, but the templates can't resolve their values. This cause a IndexOutOfBoundsException. The IndexOutOfBoundsException should be fixed in Watcher and a better error should be returned.


(Sumit Arora) #10

Thanks Martin, the issue is now resolved after applying the changes suggested by you.


(Owulff) #11

I had the same problem with Watcher 1.0.1:
failed to execute action [node_event/send_email]
org.elasticsearch.common.mustache.MustacheException: Index: 0, Size: 0@[query-template:1]
at org.elasticsearch.common.mustache.codes.DefaultCode.get(DefaultCode.java:107)
at org.elasticsearch.common.mustache.codes.ValueCode.execute(ValueCode.java:55)
at org.elasticsearch.common.mustache.codes.DefaultMustache.run(DefaultMustache.java:30)

After I fixed "ctx.payload.hits.total > 0", the exception is gone.

Can you update this in the docs:
https://www.elastic.co/guide/en/watcher/current/watching-marvel-data.html#watching-nodes


(system) #12