5.6.5 cluster.routing.allocation.enableが勝手にnoneになる

ES version : 5.6.5
OS: CentOS 7

突然、index作成時にshardが自動allocateされなくなりました。
ログにはERRORは出ておらずWARNのみ。

[2017-12-14T15:45:27,714][WARN ][rest.suppressed          ] path: /test-a/test/_search, params: {index=test-a, type=test}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:272) ~[elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:130) ~[elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:241) ~[elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:107) ~[elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.search.InitialSearchPhase.lambda$performPhaseOnShard$4(InitialSearchPhase.java:205) ~[elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:184) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144].
[2017-12-14T15:47:07,168][WARN ][rest.suppressed          ] path: /test-a/_optimize, params: {index=test-a, type=_optimize}
org.elasticsearch.action.UnavailableShardsException: [test-a][4] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[test-a][4]] containing [index {[test-a][_optimize][AWBTxehA5yvZ8feaMwFL], source[_na_]}]]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryBecauseUnavailable(TransportReplicationAction.java:892) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retryIfUnavailable(TransportReplicationAction.java:728) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.doRun(TransportReplicationAction.java:681) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onTimeout(TransportReplicationAction.java:846) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:311) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:238) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.cluster.service.ClusterService$NotifyTimeout.run(ClusterService.java:1056) [elasticsearch-5.6.5.jar:5.6.5]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.5.jar:5.6.5]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

再起動するとallocateが動くようになりますが、shard4だけはallocateされない状態です。
再起動時に以下のINFOがでていました。

[2017-12-14T15:53:03,946][INFO ][org.elasticsearch.node.Node] initialized
[2017-12-14T15:53:03,947][INFO ][org.elasticsearch.node.Node] starting ...
[2017-12-14T15:53:04,081][INFO ][org.elasticsearch.transport.TransportService] publish_address {xxx.xxx.xxx.xxx:9300}, bound_addresses {0.0.0.0:9300}
[2017-12-14T15:53:04,096][INFO ][org.elasticsearch.bootstrap.BootstrapChecks] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-12-14T15:53:07,147][INFO ][org.elasticsearch.cluster.service.ClusterService] new_master {Trv01-9300}{jyBEH2kqT_C0tpHHxIbWvg}{TK_KjL8GS4itkGcANep3GQ}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}, reason: zen-disco-elected-as-master ([0] nodes joined)
[2017-12-14T15:53:07,161][INFO ][org.elasticsearch.http.netty4.Netty4HttpServerTransport] publish_address {xxx.xxx.xxx.xxx:9200}, bound_addresses {0.0.0.0:9200}
[2017-12-14T15:53:07,161][INFO ][org.elasticsearch.node.Node] started
[2017-12-14T15:53:07,302][INFO ][org.elasticsearch.common.settings.ClusterSettings] updating [cluster.routing.allocation.enable] from [ALL] to [none]  <-- ★なぜかES側でupdateされており、この後にES側でallに戻したINFOは出ていない。
[2017-12-14T15:53:07,362][INFO ][org.elasticsearch.gateway.GatewayService] recovered [2] indices into cluster_state

手動でALLで設定しなおすと正常にもどりましたが、挙動がおかしいようにおもいます。
ちなみに5.6.3のときは問題ありませんでした。

elasticsearch% curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.allocation.enable": "all" } } '
elasticsearch% curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d' { "transient": { "cluster.routing.rebalance.enable": "all" } } '

原因、多分わかりました。
https://www.elastic.co/guide/en/elasticsearch/reference/6.0/rolling-upgrades.html

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
"persistent": {
"cluster.routing.allocation.enable": "none"
}
}
'

このコマンド、persistentではなくtransientが正しいとおもいます。
もしくはallで戻すときにtransientではなくpersistentで戻すかのどちらか。
5.6のマニュアルではtransientでしたが6.0ではpersistentにかわっているようです。

6へのupgrade手順確認でこのコマンドをたたいていたのかもしれないと思い至りました。

質問と報告ありがとうございます。
Issueを作成しておきました。

ちなみに、現在の設定がどうなっているかなどは以下のもので確認可能です。

GET /_cluster/settings

また、設定の優先度などはこちらに記載があります。
https://www.elastic.co/guide/en/elasticsearch/reference/6.1/cluster-update-settings.html#_precedence_of_settings

ありがとうございます。

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.