Illegal_state_exception is preventing deletion of watches

Hi.

In searching for a solution to my problem, I found this post from last November:

This is my problem exactly. But the solution isn't included in the post, and when I tried using the force option as suggested by @mvg ...
curl -XDELETE 'http://myServer/_watcher/watch/myWatch?force'
... I got the same illegal_state_exception / not started error.

So, @mcku, if you could let me know what your solution was, or if others can tell me how to fix this, I'd appreciate it. I saw that you mentioned being careful with trigger intervals -- I don't imagine that that's my problem, as this was running for months before this problem popped up (none of my watches are more frequent than once per minute). The server did run out of space recently, and we had to do a bunch of cleanup to get things running again, so I'm wondering if something related to that caused this problem. Some of the watch executions seem to be happening at irregular intervals sometimes as well, so that might be related. Thanks.

Hey,

did you stop watcher before trying to do this? Can you make sure it is started?

--Alex

Thanks for your reply, Alex. The answer to your question is pretty bizarre: My watches had been running (though at strangely irregular intervals as I mentioned earlier), and in fact the reason that I tried to delete one of them was because it was alerting me in situations when I didn't want it to. So, clearly Watcher was running, and thus it never occurred to me to check whether it was running. (Being kind of a novice, I assumed that the "not started" part of the error was referring to the watch, not Watcher -- which still didn't make any sense, but anyway...) But after receiving your reply, I checked Watcher's status, and sure enough, it said it's not running:
{ "watcher_state" : "stopped", "watch_count" : 18, "execution_thread_pool" : { "queue_size" : 0, "max_size" : 20 }, "manually_stopped" : false }
So I tried starting it, and that gave me:
failed to start watcher. please wait for the cluster to become ready or try to start Watcher manually
So then I restarted ES (which has been working fine), and I got that same "failed to start" message again. And yet my watches are still running (although irregularly).

Also -- and I don't know if this is related -- the license information during startup said that our license expires on August 10, so somehow it reverted from our license which is good until next April to a 30-day trial license. I was able to reinstall our license successfully. But things aren't working any better.

So I'm lost. I keep thinking that maybe some stuff got corrupted when the server ran out of space recently.

Also, I now see that we're sometimes getting errors saying that a watch failed to run due to thread pool capacity. I can't seem to find any useful information about that.

Thanks for any further help you can provide.

Hey,

that sounds indeed weird and running out of space might mean that some information could not be stored and thus you possibly went out of sync (wildly speculating). Could you have suffered a split brain scenario as well due to a configuration problem and thus also have two clusters with different licenses that failed to join again properly without applying your valid licnse?

Have you ensured there is only a single master node?
Which Elasticsearch version are you using?
Can you call _watcher/stats/_all, which also shows current executions? (There should be none).
Also can you supply the stacktrace from the master node, that occured when you tried to start watcher?

--Alex

Thanks, Alex. Some answers to your questions...

We have only one server, so presumably there shouldn't be a multiple cluster/node problem. I checked _nodes and saw only our one node, so at least there doesn't seem to be any weirdness going on there! :slight_smile:

We're using version 2.0.0 for ES and Watcher.

Here's the _watcher/stats/_all output:
{ "watcher_state" : "stopped", "watch_count" : 18, "execution_thread_pool" : { "queue_size" : 0, "max_size" : 20 }, "current_watches" : [ ], "queued_watches" : [ ], "manually_stopped" : false }And I checked some of my watches, and the last time they executed was Monday evening, so everything seems to have died completely now -- not even executing at irregular intervals.

Here's the stack trace that was printed along with the Watcher failure message:
failed to start watcher. please wait for the cluster to become ready or try to start Watcher manually StrictDynamicMappingException[mapping set to strict, dynamic introduction of [messages] within [watch_record] is not allowed] at org.elasticsearch.index.mapper.DocumentParser.parseArray(DocumentParser.java:393) at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:259) at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:128) at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:79) at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:318) at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:503) at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:494) at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1052) at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1060) at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:579) at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:452) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
Thanks again.

Hey,

have you deleted your index templates maybe? Is your .watch-history index mapping or the watch history template gone (check with the template API)? When was your out of disk-space problem happening? Before that? Still puzzled by this the above error message.

Also, is it possible for you to upgrade? There have been a fair share of changes and bugfixes between 2.0.0 and 2.3.4 for watcher.

--Alex

Well Alex, maybe we're getting somewhere now. The result for _template includes:
"watches" : { "order" : 2147483647, "template" : ".watches", "settings" : { "index" : { "mapper" : { "dynamic" : "false" }, "number_of_shards" : "1" } }, "mappings" : { [leaving out all the mapping stuff] }, "aliases" : { } }, "watch_history" : { "order" : 2147483647, "template" : ".watch_history*", "settings" : { "index" : { "mapper" : { "dynamic" : "false" }, "number_of_shards" : "1" } }, "mappings" : { [leaving out all the mapping stuff] }, "aliases" : { } }, "triggered_watches" : { "order" : 2147483647, "template" : ".triggered_watches", "settings" : { "index" : { "mapper" : { "dynamic" : "false" }, "number_of_shards" : "1", "refresh_interval" : "-1" } }, "mappings" : { [leaving out all the mapping stuff] }, "aliases" : { } }
So the .watch_history template is there. But .watch_history/_mapping returned an index_not_found_exception, while .watches/_mapping and .triggered_watches/_mapping returned mappings. So I re-created the .watch_history index and restarted. But I'm still getting the same errors -- "failed to start watcher" on startup, and illegal_state_exception / "not started" when I try to delete a watch.

And yes, I discovered the inability to delete watches after the server ran out of space and we recovered from that.

Any other ideas? Thanks.

Hey,

can you paste the full output of the mapping of the watch history? That entry looks weird because dynamic is set to false but should be set to strict - which can result in the above error.

After that, you can try to delete the watch history template by executing

curl -XDELETE localhost:9200/_template/watch_history

Shortly after this you can call

curl localhost:9200/_template/watch_history

and should see that the template has been created again (watcher is checking if the template is missing and adds it back automatically)

Did you upgrade from watcher 1.x maybe? Wondering where this template comes from (again, anything is harder to tell once you run out-of-disk-space).

--Alex

Well, it seems that I can't give you the output that you want. I tried to post a reply containing the watch history mapping, but I got an error saying that posts are limited to 5K characters. So I tried to upload the mapping as a file, along with another file containing a diff, explained below, but I got an error saying that it only allows uploads of image files. So I tried renaming the files with a .jpg extension to fool it into uploading them, but it's too smart for that -- it checked to see whether the files were really in jpeg format and said it couldn't upload them because they appeared to be corrupt. If I have to, I could break the output up into many pieces and post them as a series of replies, but if you have a way for me to get the entire contents to you, let me know.

Meanwhile, here's a discussion of what I did and what happened...

I did what you suggested. After capturing the watch_history mapping, I deleted the template, and I almost immediately got an alert (I've been getting that alert occasionally (and correctly) during all of this, at odd intervals, but something about the deletion or re-creation seems to have triggered it).

After the watch history template was re-created, I diffed it with the earlier one, and there were quite a few differences. But one thing in particular didn't change -- the dynamic setting that you said should be strict was still false after the template was re-created.

After re-creation, Watcher still reported that it was stopped, and when I tried to start it, I still got the same old "failed to start" error.

As for whether I upgraded from Watcher 1.x, to be honest, I don't remember. I set all this up last fall, and I might have started with 1.x and then upgraded, or I might have started with 2.0.0.

Thanks again.

Hey,

you can just put the output into a gist instead of trying to put it here. Can you paste the output of this command here as well (no worries, it is short)

curl 'localhost:9200/_template?filter_path=**.template&pretty'

I'd like to rule out, that there is more than one template fiddling with your indices.

Also I think we are talking about two different dynamic settings, where one is set to false (which belongs to the watch history index, and one is set to strict which belongs to the watch_record type but still is part of the watch_history template.

--Alex

Okay, the gist is here:
[https://gist.github.com/anonymous/790feba7f403d2afea7aa2bbb60d1d5e]

It appears that you're correct about the false-vs-strict difference.

The _template output is:
{ "triggered_watches" : { "template" : ".triggered_watches" }, "watch_history" : { "template" : ".watch_history*" }, "watches" : { "template" : ".watches" } }
Thanks.

Hey,

this gets more weird everytime I read your descriptions - and I dont mean your descriptions are weird, I'd really like to get to the bottom of things here. Two things we can do now

First, the exception you pasted above, was this all of the exception that you pasted or only a part?

Second, can you add this to the logging.yml file, then start elasticsearch and then try to start watcher again. You need to add the watcher: TRACE entry , the other entries should already be there.

es.logger.level: INFO
rootLogger: ${es.logger.level}, console, file
logger:
  # log action execution errors for easier debugging
  action: DEBUG
  watcher: TRACE

Now we should see more information in the log file, when you start watcher. Can you paste those into a gist, please?

Thanks a lot for your patience.

--Alex

Hi --

Yes, the exception I pasted earlier was the whole thing.

When I turned on TRACE for Watcher and restarted, I got a flood of the same few messages repeatedly, several hundred per minute, so I turned it back off and restarted again. I grepped through them to find the set of unique messages, and I'm pasting that set here:
[2016-07-24 23:39:59,465][TRACE][watcher.support ] [Rom the Spaceknight] not adding index template [triggered_watches], because it already exists [2016-07-24 23:39:59,466][TRACE][watcher.support ] [Rom the Spaceknight] not adding index template [watch_history], because it already exists [2016-07-24 23:39:59,466][TRACE][watcher.support ] [Rom the Spaceknight] not adding index template [watches], because it already exists [2016-07-24 23:39:59,468][DEBUG][watcher.watch ] [Rom the Spaceknight] not all primary shards active for index [.watches], so we cannot start [2016-07-24 23:39:59,469][DEBUG][watcher ] [Rom the Spaceknight] not starting watcher. because the cluster isn't ready yet to run watcher
Thanks.

Hey,

ok, that gives us more info. I suspect, that your shard is corrupted (which may have been a side effect of your out-of-disk-space issue earlier.

Can you check first, if the .watches index is maybe just closed (which would have been done manually)? Just use the cat API to find out

GET _cat/indices?v

If the index is marked as closed, you can try to open it again. If not, it's highly likely that the data is corrupted. So we should try to find out. You can shut down your node, configure index.shard.check_on_startup: true in your elasticsearch config file and restart. If you then see a message like index check failure but can't fix it in your logs, you can try setting index.shard.check_on_startup: fix and restart, which - under certain circumstances - tries to recreate segment files to make the index work again. This wont work for the translog, but it is worth a try.

Also, it makes sense to increase to increase logging for shard recovery by setting index.shard: DEBUG, so you get some more infos in your logs.

--Alex

Okay, the .watches index wasn't closed. This is the output I got from the cat API:
yellow open .watches 1 1 18 60 34kb 34kb

I added the two settings you suggested to elasticsearch.yml and restarted. I didn't get any log messages about an index check, and I still got the "watcher failed to start" message, so nothing seems to have changed.

Thanks.

Hey,

did you set the index.shard: DEBUG in the logging configuration? Did you then see a message like check index [success] in the logs?

--Alex

Oops, sorry -- I put the debug line in elasticsearch.yml instead of logging.yml! Now that I've put it in logging.yml, I got this output:
[.watches][0] check index [success]
But I'm still getting the "watcher failed to start" message, and its status is still stopped.

Thanks.

Hey,

is it somehow possible, that you can share your data/ directory, so I can try and reproduce this locally with Elasticsearch 2.0.0? Of course not publicly, my email is $firstname.$lastname@elastic.co.

--Alex

I can ask my management if it'd be okay to share our data with you -- I don't expect that it would be a problem. I think the bigger issue is that our indices currently have 55GB of data, so simply giving you a copy of the whole data directory isn't practical. Do you think it would make sense to tar up a skeleton version of the directory with just one or two indices in it? Our smallest daily indices are about 100MB each (most are much bigger than that). Looks like they compress down to about half that.

Thanks.

Hey,

actually I think the most important part is the .watches index, not sure I need anything else for a first test.

Directories needed from the data directory:

  • data/$CLUSTER/nodes/0/indices/.watches
  • data/$CLUSTER/nodes/0/indices/.triggered-watches
  • data/$CLUSTER/nodes/0/_state/

--Alex