Can't shutdown Elasticsearch when a watch is stuck

Enniu_51 · October 9, 2015, 4:49am

Es log:

[2015-10-09 12:29:25,149][WARN ][watcher.watch            ] [crawler_service_001] failed to acquire lock on watch [my-watch] (waited for [30 seconds]). It is possible that for some reason this watch execution is stuck
[2015-10-09 12:37:48,938][INFO ][action.admin.cluster.node.shutdown] [crawler_service_001] [cluster_shutdown]: requested, shutting down in [1s]
[2015-10-09 12:37:49,941][INFO ][watcher                  ] [crawler_service_001] stopping watch service...
^C
[2015-10-09 12:38:49,919][INFO ][node                     ] [crawler_service_001] stopping ...

I put a watch, and I find it stuck, then I try to shutdown the Elasticsearch by:

 curl -XPOST 'http://localhost:9200/_shutdown'

But Elasticsearch can't shutdown, it's just stopping(last more than 10 minutes, from the time I post), what should I do now?

warkolm · October 9, 2015, 4:50am

Have you tried using service elasticsearch shutdown?

Enniu_51 · October 9, 2015, 4:55am

Sorry but I just don't understand what does service elasticsearch shutdown mean... Could give me something like curl -X '...'? Or tell me some more details?

warkolm · October 9, 2015, 5:13am

It's a linux command, which won't apply on Windows.

Enniu_51 · October 9, 2015, 5:16am

I googled about service, but I didn't start elasticsearch as service. I just ES_HOME/bin/elasticsearch.

warkolm · October 9, 2015, 5:26am

Ok then this doesn't matter.

What's the current status of the watch right now?

Enniu_51 · October 9, 2015, 5:30am

I run this before I try to shutdown Elasticsearch:

curl -XGET 'http://localhost:9200/_watcher/stats/_all?emit_stacktraces&pretty'
{
  "watcher_state" : "started",
  "watch_count" : 0,
  "execution_thread_pool" : {
    "queue_size" : 17,
    "max_size" : 10
  },
  "current_watches" : [ {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_2-2015-10-09T03:32:00.458Z",
    "triggered_time" : "2015-10-09T03:32:00.458Z",
    "execution_time" : "2015-10-09T03:32:00.458Z",
    "execution_phase" : "actions",
    "executed_actions" : [ ],
    "stack_trace" : [ "java.net.SocketInputStream.socketRead0(Native Method)", "java.net.SocketInputStream.socketRead(SocketInputStream.java:116)", "java.net.SocketInputStream.read(SocketInputStream.java:170)", "java.net.SocketInputStream.read(SocketInputStream.java:141)", "com.sun.mail.util.TraceInputStream.read(TraceInputStream.java:124)", "java.io.BufferedInputStream.fill(BufferedInputStream.java:246)", "java.io.BufferedInputStream.read(BufferedInputStream.java:265)", "com.sun.mail.util.LineInputStream.readLine(LineInputStream.java:89)", "com.sun.mail.smtp.SMTPTransport.readServerResponse(SMTPTransport.java:2285)", "com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:2032)", "com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:697)", "javax.mail.Service.connect(Service.java:364)", "org.elasticsearch.watcher.actions.email.service.Account.send(Account.java:99)", "org.elasticsearch.watcher.actions.email.service.InternalEmailService.send(InternalEmailService.java:84)", "org.elasticsearch.watcher.actions.email.service.InternalEmailService.send(InternalEmailService.java:78)", "org.elasticsearch.watcher.actions.email.ExecutableEmailAction.execute(ExecutableEmailAction.java:67)", "org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:104)", "org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:379)", "org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)", "org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)" ]
  } ],
  "queued_watches" : [ {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_12-2015-10-09T03:42:00.138Z",
    "triggered_time" : "2015-10-09T03:42:00.137Z",
    "execution_time" : "2015-10-09T03:42:00.138Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_13-2015-10-09T03:43:00.154Z",
    "triggered_time" : "2015-10-09T03:43:00.154Z",
    "execution_time" : "2015-10-09T03:43:00.154Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_14-2015-10-09T03:44:00.170Z",
    "triggered_time" : "2015-10-09T03:44:00.170Z",
    "execution_time" : "2015-10-09T03:44:00.170Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_15-2015-10-09T03:45:00.187Z",
    "triggered_time" : "2015-10-09T03:45:00.186Z",
    "execution_time" : "2015-10-09T03:45:00.187Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_16-2015-10-09T03:46:00.207Z",
    "triggered_time" : "2015-10-09T03:46:00.207Z",
    "execution_time" : "2015-10-09T03:46:00.207Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_17-2015-10-09T03:47:00.223Z",
    "triggered_time" : "2015-10-09T03:47:00.223Z",
    "execution_time" : "2015-10-09T03:47:00.223Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_18-2015-10-09T03:48:00.241Z",
    "triggered_time" : "2015-10-09T03:48:00.241Z",
    "execution_time" : "2015-10-09T03:48:00.241Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_19-2015-10-09T03:49:00.273Z",
    "triggered_time" : "2015-10-09T03:49:00.272Z",
    "execution_time" : "2015-10-09T03:49:00.273Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_20-2015-10-09T03:50:00.291Z",
    "triggered_time" : "2015-10-09T03:50:00.291Z",
    "execution_time" : "2015-10-09T03:50:00.291Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_21-2015-10-09T03:51:00.308Z",
    "triggered_time" : "2015-10-09T03:51:00.308Z",
    "execution_time" : "2015-10-09T03:51:00.308Z"
  } ...(delete some queued watches because post length limit exceed)]
}

but now, I run the same as before, it just get:

curl -XGET 'http://localhost:9200/_watcher/stats/_all?emit_stacktraces&pretty'
curl: (7) couldn't connect to host

warkolm · October 9, 2015, 5:34am

It looks like there is a problem with your SMTP setup. What does my-watch look like?

You might need to kill the process to get ES to stop.

Enniu_51 · October 9, 2015, 5:36am

public boolean deployWatch() {
        WatchSourceBuilder watchSourceBuilder = WatchSourceBuilders.watchBuilder();
        watchSourceBuilder.trigger(TriggerBuilders.schedule(Schedules.cron("0 0/1 * * * ?")));
        SearchRequest request = Requests.searchRequest("logstash*").source(searchSource()
                .query(filteredQuery(matchQuery("error_code", "*"), boolFilter()
                        .must(rangeFilter("date").gt("{{ctx.trigger.scheduled_time}}"))
                        .must(rangeFilter("date").lt("{{ctx.execution_time}}")))));
        watchSourceBuilder.input(new SearchInput(request, null));
        watchSourceBuilder.condition(new ScriptCondition(Script.inline("ctx.payload.hits.total > 0").build()));
        EmailTemplate.Builder emailBuilder = EmailTemplate.builder();
        emailBuilder.to("xxx@yyy.com");
        emailBuilder.subject("Error recently encountered");
        EmailAction.Builder emailActionBuilder = EmailAction.builder(emailBuilder.build());
        watchSourceBuilder.addAction("email_someone", emailActionBuilder.build());
        PutWatchResponse putWatchResponse = watcherClient.preparePutWatch("my-watch")
                .setSource(watchSourceBuilder)
                .get();
        if (!putWatchResponse.isCreated()) {
            return false;
        }
        return true;
    }

I use java client to put that watch. It get stuck after post two emails.

Enniu_51 · October 9, 2015, 5:43am

I used to use kill pid or (if previous not work)kill -9 pid to terminate Elasticsearch, but in the last time, it broke the lucene index of some shard... Since that time, I never use kill pid to shutdown elaticsearch...

Enniu_51 · October 9, 2015, 5:54am

Okay, I use kill -9 and restart. Nothing go wrong and my-watch is deleted(I tried force delete before shutdown). But is there any better way than kill -9 to handle things like this?

dadoonet · October 9, 2015, 6:03am

Not answering directly but FYI I just reported this issue to the watcher project.

Enniu_51 · October 9, 2015, 6:05am

Could you give me the link of that issue?

dadoonet · October 9, 2015, 6:17am

No. It's not a public URL.

mvg · October 9, 2015, 8:05am

Watcher should just shutdown regardless of how your watches are configured. When watchers stops it will not execute any new watches and remove any queued watch executions and then wait unto 30 seconds for the already executed watches to complete. If by then not all watch executions have been completed, watcher will continue to shutdown.

If Watcher gets stuck again, can you share a hot threads output here?
(curl 'localhost:9200/_cluster/nodes/hotthreads'

Enniu_51 · October 9, 2015, 8:09am

Okay, I will.

Enniu_51 · October 9, 2015, 9:49am

Something similarly occurs:

curl 'localhost:9200/_cluster/nodes/hotthreads'

::: [crawler_service_001][2Uv6aeMXTY2-QU-wPSHT5g][crawlerservice2][inet[/10.0.40.156:9300]]
   Hot threads at 2015-10-09T09:42:54.691Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (116.9micros out of 500ms) cpu usage by thread 'elasticsearch[crawler_service_001][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

Then I try to shutdown elasticsearch:

curl -XPOST 'http://localhost:9200/_shutdown'

LOG:

[2015-10-09 17:46:41,904][INFO ][action.admin.cluster.node.shutdown] [crawler_service_001] [cluster_shutdown]: requested, shutting down in [1s]
[2015-10-09 17:46:42,906][INFO ][watcher                  ] [crawler_service_001] stopping watch service...
^C[2015-10-09 17:46:56,950][INFO ][node                     ] [crawler_service_001] stopping ...
^C^C

ES gets stuck again when watch queue is blocked..

mvg · October 9, 2015, 2:03pm

Is that the entire hot threads dump? I would expect it to contain more stack traces of running threads. Can you also share the hot threads dump after you instructed ES to shutdown?

Enniu_51 · October 12, 2015, 12:22am

How to get entire hot threads dump? curl 'localhost:9200/_cluster/nodes/hotthreads' is not enough?

Topic		Replies	Views
Unable to update/delete/execute watch Elasticsearch elastic-stack-alerting	16	3956	July 6, 2017
A question for the "Stopping watcher service" message Elasticsearch	2	25	August 9, 2024
Elasticsearch Watcher Stops processing after recovering from full disk Elasticsearch	2	300	October 3, 2018
Unable start elasticsearch Elasticsearch	5	958	January 3, 2022
Can't create a new Watch Elasticsearch elastic-stack-alerting	2	1015	July 6, 2017

Can't shutdown Elasticsearch when a watch is stuck

Related topics