Can't shutdown Elasticsearch when a watch is stuck


(Sayakiss) #1

Es log:

[2015-10-09 12:29:25,149][WARN ][watcher.watch            ] [crawler_service_001] failed to acquire lock on watch [my-watch] (waited for [30 seconds]). It is possible that for some reason this watch execution is stuck
[2015-10-09 12:37:48,938][INFO ][action.admin.cluster.node.shutdown] [crawler_service_001] [cluster_shutdown]: requested, shutting down in [1s]
[2015-10-09 12:37:49,941][INFO ][watcher                  ] [crawler_service_001] stopping watch service...
^C
[2015-10-09 12:38:49,919][INFO ][node                     ] [crawler_service_001] stopping ...

I put a watch, and I find it stuck, then I try to shutdown the Elasticsearch by:

 curl -XPOST 'http://localhost:9200/_shutdown'

But Elasticsearch can't shutdown, it's just stopping(last more than 10 minutes, from the time I post), what should I do now?


(Mark Walkom) #2

Have you tried using service elasticsearch shutdown?


(Sayakiss) #3

Sorry but I just don't understand what does service elasticsearch shutdown mean... Could give me something like curl -X '...'? Or tell me some more details?


(Mark Walkom) #4

It's a linux command, which won't apply on Windows.


(Sayakiss) #5

I googled about service, but I didn't start elasticsearch as service. I just ES_HOME/bin/elasticsearch.


(Mark Walkom) #6

Ok then this doesn't matter.

What's the current status of the watch right now?


(Sayakiss) #7

I run this before I try to shutdown Elasticsearch:

curl -XGET 'http://localhost:9200/_watcher/stats/_all?emit_stacktraces&pretty'
{
  "watcher_state" : "started",
  "watch_count" : 0,
  "execution_thread_pool" : {
    "queue_size" : 17,
    "max_size" : 10
  },
  "current_watches" : [ {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_2-2015-10-09T03:32:00.458Z",
    "triggered_time" : "2015-10-09T03:32:00.458Z",
    "execution_time" : "2015-10-09T03:32:00.458Z",
    "execution_phase" : "actions",
    "executed_actions" : [ ],
    "stack_trace" : [ "java.net.SocketInputStream.socketRead0(Native Method)", "java.net.SocketInputStream.socketRead(SocketInputStream.java:116)", "java.net.SocketInputStream.read(SocketInputStream.java:170)", "java.net.SocketInputStream.read(SocketInputStream.java:141)", "com.sun.mail.util.TraceInputStream.read(TraceInputStream.java:124)", "java.io.BufferedInputStream.fill(BufferedInputStream.java:246)", "java.io.BufferedInputStream.read(BufferedInputStream.java:265)", "com.sun.mail.util.LineInputStream.readLine(LineInputStream.java:89)", "com.sun.mail.smtp.SMTPTransport.readServerResponse(SMTPTransport.java:2285)", "com.sun.mail.smtp.SMTPTransport.openServer(SMTPTransport.java:2032)", "com.sun.mail.smtp.SMTPTransport.protocolConnect(SMTPTransport.java:697)", "javax.mail.Service.connect(Service.java:364)", "org.elasticsearch.watcher.actions.email.service.Account.send(Account.java:99)", "org.elasticsearch.watcher.actions.email.service.InternalEmailService.send(InternalEmailService.java:84)", "org.elasticsearch.watcher.actions.email.service.InternalEmailService.send(InternalEmailService.java:78)", "org.elasticsearch.watcher.actions.email.ExecutableEmailAction.execute(ExecutableEmailAction.java:67)", "org.elasticsearch.watcher.actions.ActionWrapper.execute(ActionWrapper.java:104)", "org.elasticsearch.watcher.execution.ExecutionService.executeInner(ExecutionService.java:379)", "org.elasticsearch.watcher.execution.ExecutionService.execute(ExecutionService.java:271)", "org.elasticsearch.watcher.execution.ExecutionService$WatchExecutionTask.run(ExecutionService.java:417)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)" ]
  } ],
  "queued_watches" : [ {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_12-2015-10-09T03:42:00.138Z",
    "triggered_time" : "2015-10-09T03:42:00.137Z",
    "execution_time" : "2015-10-09T03:42:00.138Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_13-2015-10-09T03:43:00.154Z",
    "triggered_time" : "2015-10-09T03:43:00.154Z",
    "execution_time" : "2015-10-09T03:43:00.154Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_14-2015-10-09T03:44:00.170Z",
    "triggered_time" : "2015-10-09T03:44:00.170Z",
    "execution_time" : "2015-10-09T03:44:00.170Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_15-2015-10-09T03:45:00.187Z",
    "triggered_time" : "2015-10-09T03:45:00.186Z",
    "execution_time" : "2015-10-09T03:45:00.187Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_16-2015-10-09T03:46:00.207Z",
    "triggered_time" : "2015-10-09T03:46:00.207Z",
    "execution_time" : "2015-10-09T03:46:00.207Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_17-2015-10-09T03:47:00.223Z",
    "triggered_time" : "2015-10-09T03:47:00.223Z",
    "execution_time" : "2015-10-09T03:47:00.223Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_18-2015-10-09T03:48:00.241Z",
    "triggered_time" : "2015-10-09T03:48:00.241Z",
    "execution_time" : "2015-10-09T03:48:00.241Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_19-2015-10-09T03:49:00.273Z",
    "triggered_time" : "2015-10-09T03:49:00.272Z",
    "execution_time" : "2015-10-09T03:49:00.273Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_20-2015-10-09T03:50:00.291Z",
    "triggered_time" : "2015-10-09T03:50:00.291Z",
    "execution_time" : "2015-10-09T03:50:00.291Z"
  }, {
    "watch_id" : "my-watch",
    "watch_record_id" : "my-watch_21-2015-10-09T03:51:00.308Z",
    "triggered_time" : "2015-10-09T03:51:00.308Z",
    "execution_time" : "2015-10-09T03:51:00.308Z"
  } ...(delete some queued watches because post length limit exceed)]
}

but now, I run the same as before, it just get:

curl -XGET 'http://localhost:9200/_watcher/stats/_all?emit_stacktraces&pretty'
curl: (7) couldn't connect to host

(Mark Walkom) #8

It looks like there is a problem with your SMTP setup. What does my-watch look like?

You might need to kill the process to get ES to stop.


(Sayakiss) #9
public boolean deployWatch() {
        WatchSourceBuilder watchSourceBuilder = WatchSourceBuilders.watchBuilder();
        watchSourceBuilder.trigger(TriggerBuilders.schedule(Schedules.cron("0 0/1 * * * ?")));
        SearchRequest request = Requests.searchRequest("logstash*").source(searchSource()
                .query(filteredQuery(matchQuery("error_code", "*"), boolFilter()
                        .must(rangeFilter("date").gt("{{ctx.trigger.scheduled_time}}"))
                        .must(rangeFilter("date").lt("{{ctx.execution_time}}")))));
        watchSourceBuilder.input(new SearchInput(request, null));
        watchSourceBuilder.condition(new ScriptCondition(Script.inline("ctx.payload.hits.total > 0").build()));
        EmailTemplate.Builder emailBuilder = EmailTemplate.builder();
        emailBuilder.to("xxx@yyy.com");
        emailBuilder.subject("Error recently encountered");
        EmailAction.Builder emailActionBuilder = EmailAction.builder(emailBuilder.build());
        watchSourceBuilder.addAction("email_someone", emailActionBuilder.build());
        PutWatchResponse putWatchResponse = watcherClient.preparePutWatch("my-watch")
                .setSource(watchSourceBuilder)
                .get();
        if (!putWatchResponse.isCreated()) {
            return false;
        }
        return true;
    }

I use java client to put that watch. It get stuck after post two emails.


(Sayakiss) #10

I used to use kill pid or (if previous not work)kill -9 pid to terminate Elasticsearch, but in the last time, it broke the lucene index of some shard... Since that time, I never use kill pid to shutdown elaticsearch...


(Sayakiss) #11

Okay, I use kill -9 and restart. Nothing go wrong and my-watch is deleted(I tried force delete before shutdown). But is there any better way than kill -9 to handle things like this?


(David Pilato) #12

Not answering directly but FYI I just reported this issue to the watcher project.


(Sayakiss) #13

Could you give me the link of that issue?


(David Pilato) #14

No. It's not a public URL.


(Martijn Van Groningen) #15

Watcher should just shutdown regardless of how your watches are configured. When watchers stops it will not execute any new watches and remove any queued watch executions and then wait unto 30 seconds for the already executed watches to complete. If by then not all watch executions have been completed, watcher will continue to shutdown.

If Watcher gets stuck again, can you share a hot threads output here?
(curl 'localhost:9200/_cluster/nodes/hotthreads'


(Sayakiss) #16

Okay, I will.


(Sayakiss) #17

Something similarly occurs:

curl 'localhost:9200/_cluster/nodes/hotthreads'

::: [crawler_service_001][2Uv6aeMXTY2-QU-wPSHT5g][crawlerservice2][inet[/10.0.40.156:9300]]
   Hot threads at 2015-10-09T09:42:54.691Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
    0.0% (116.9micros out of 500ms) cpu usage by thread 'elasticsearch[crawler_service_001][transport_client_timer][T#1]{Hashed wheel timer #1}'
     10/10 snapshots sharing following 5 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
       org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
       org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
       java.lang.Thread.run(Thread.java:745)

Then I try to shutdown elasticsearch:

curl -XPOST 'http://localhost:9200/_shutdown'

LOG:

[2015-10-09 17:46:41,904][INFO ][action.admin.cluster.node.shutdown] [crawler_service_001] [cluster_shutdown]: requested, shutting down in [1s]
[2015-10-09 17:46:42,906][INFO ][watcher                  ] [crawler_service_001] stopping watch service...
^C[2015-10-09 17:46:56,950][INFO ][node                     ] [crawler_service_001] stopping ...
^C^C

ES gets stuck again when watch queue is blocked..


(Martijn Van Groningen) #18

Is that the entire hot threads dump? I would expect it to contain more stack traces of running threads. Can you also share the hot threads dump after you instructed ES to shutdown?


(Sayakiss) #19

How to get entire hot threads dump? curl 'localhost:9200/_cluster/nodes/hotthreads' is not enough?


(system) #20