Synonyms update causes cluster to go red and become unrecoverable

We have a 4 node cluster using elasticsearch-6.5.2 and recently we did a synonyms update which caused the cluster state to turn red for two indexes out of four. The initial error we discovered was due to the shard failing to allocate:

 "Failed shard on node [x] : failed to create index, failure illegaArgumentException[Failed to build synonyms]; nested NotSerializableExceptionWrapper[parse_exception: Invalid synonyms rule at line 2; nest IllegalArgumentException[term: termination of pregnancy anazlyed to a token (pregnancy) with position increment != 1 (got: 2)]; the allocate_explanation was "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy"

We understand why the above error occurred because the analysis chain has the stopwords filter first and then at the end has the synonyms filter and we had agreed a procedure whereby no stopwords were to be entered into the synonyms file, however accidentally a stopword was entered.

What we don't understand is why the synonyms update caused the state to turn to red and not recover. Since this happend on the live instance we quickly rebuilt the indexes and deleted the old ones when were realised they weren't recoverable. Looking at the logs there is some interesting information about failed to list shard for shard_store on node:
please see: https://gist.github.com/imranazad/7436c43bb7ca87a1ce1f64b988d22a83

Just to add context when we update the synonyms file we restart the elasticsearch service via the command line .

So what caused the unrecoverability of the indexes? I'm not convinced it was the direct result of the synonyms update although yes that would have stopped the shard allocation but it shouldn't have made the indexes unrecoverable.

Hi @imranazad

sorry to hear you are having trouble with restarting your cluster. I just tried re-creating the scenario you described locally with 6.8.3 and while I can see the synonym parsing IllegalArgumentException that you already diagnosed correctly (btw, the "lenient" parameter can also help skipping invalid synonym rules, problem there is that you don't notice theire not working until much later), the result for me is that the index in question is "closed" after restart and cannot be openend, but I don't get a red state.

Do you remember the update procedure? How many nodes do you have and did you take them all down or performed a rolling restart similar to the procedure described here (just without the version upgrade)?

The scenario I tried was just a single local node, stopped that one, added an invalid line to the synonyms file and restarted, leading to a "closed" index but no red state.

At this point I would also doubt the synonyms being the only culprit, but to be sure we'd need a bit more insight into the logs.

Hey Christoph,

Thanks so much for getting back to me, really appreciate you trying to reproduce the issue.

We figured out in the end what the issue was. In a panic we decided to rebuild the indexes by rolling back the synonyms file. Whereas as all we really needed to do was roll back the synonyms file and then wait for the cluster to recover by itself.

Thanks again, this can be closed.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.