Elasticsearch and Portworx

Hi All,

Has anybody had success with using these two technologies together?

We are running ES in a docker container, orchestrated by Hashicorp Nomad, and have successfully instantiated ES using portworx volumes. The problem arises in that when we cycle our nodes, the shards have to be re-initialized.

Prior to the instantiation of the new container on a new node, we disable shard allocation and do a synced flush. The theory is that the new node just sees the data in the portworx volume, and leads to a minimized yellow state in the cluster. However it seems after we bring the new ES instance online, ES seems to do a full rebalance, instead of seeing the existing shards on the node already.

Not that it's super relevant but we've had great success using this same technology cycling out and replacing galera cluster containers.

Anybody running Portworx and ES on DCOS or Kubernetes have this issue? Anybody out there running ES, Portworx, and Nomad have this figured out?

Thanks!

Which version of Elasticsearch are you running?

Hi Christian,

We are running the elasticsearch-oss:6.7.1 container.

Thanks for looking at this thread!

I'm not sure exactly what you mean by "full rebalance". Can you elaborate, and give details (e.g. logs or API responses) to show what you're seeing?

Hi David,

I've tried my best to study the problem and have some better information for you. First I have to describe portworx. It's a tool that allows for stateful containers. If you restart a container, that leverages the portworx docker driver, the data also returns when the container is restarted, or moved.

I've validated that portworx holds up it's end of the bargain and the data is not lost and returns with the container. So here is what I am seeing.

First I put the cluster into a state where cluster.routing.allocation.enable is 'none', and I do a synced flush. This is the state of cluster.

{
  "cluster_name": "bpcw-dev-cluster",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 513,
  "active_shards": 1026,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

Next the node is restarted, the cluster state becomes as expected

{
  "cluster_name": "bpcw-dev-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 2,
  "number_of_data_nodes": 2,
  "active_primary_shards": 513,
  "active_shards": 684,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 342,
  "delayed_unassigned_shards": 342,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 66.66666666666666
}

During this period of time here are the logs from the container that was restarted. You can see the initial change cluster state to none.

[2019-10-10T17:56:58,980][INFO ][o.e.c.s.ClusterSettings  ] [bpcw-node-2] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-10-10T18:00:24,358][INFO ][o.e.n.Node               ] [bpcw-node-2] stopping ...

This part loops a few times 
[2019-10-10T18:00:24,541][ERROR][i.n.u.c.D.rejectedExecution] [bpcw-node-2] Failed to submit a listener notification task. Event loop shut down?
java.util.concurrent.RejectedExecutionException: event executor terminated

Node is restarted and coming back up
[2019-10-10T18:01:36,841][INFO ][o.e.c.s.ClusterSettings  ] [bpcw-node-2] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-10-10T18:02:03,298][WARN ][o.e.n.Node               ] [bpcw-node-2] timed out while waiting for initial discovery state - timeout: 30s
[2019-10-10T18:02:03,313][INFO ][o.e.h.n.Netty4HttpServerTransport] [bpcw-node-2] publish_address {127.0.0.1:27959}, bound_addresses {127.0.0.1:27959}
[2019-10-10T18:02:03,313][INFO ][o.e.n.Node               ] [bpcw-node-2] started
[2019-10-10T18:02:41,078][WARN ][o.e.c.s.ClusterApplierService] [bpcw-node-2] cluster state applier task [apply cluster state (from master [master {bpcw-node-0}{mbL724baT0Keoq3KgbLo4w}{LN3f8aHRRuiBnCHjv3Ei9w}{10.176.16.115}{10.176.16.115:9300} committed version [16763]])] took [1m] above the warn threshold of 30s

Part 2.

These are the corresponding master node logs during this time frame starting from the intial set to none.

[2019-10-10T17:56:59,070][INFO ][o.e.c.s.ClusterSettings  ] [bpcw-node-0] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-10-10T17:57:06,536][INFO ][o.e.m.j.JvmGcMonitorService] [bpcw-node-0] [gc][15074] overhead, spent [265ms] collecting in the last [1s]
[2019-10-10T17:57:25,378][WARN ][o.e.i.f.SyncedFlushService] [bpcw-node-0] [logstash-2019.10.10][1] can't to issue sync id [1CCLb_xdRmWoaoeLUDsLUg] for out of sync replica [[logstash-2019.10.10][1], node[JOjvG-OgQB6CbLubR-7CMw], [R], s[STARTED], a[id=ZHYr3o0OQOKdmRpy0TGs1g]] with num docs [120375]; num docs on primary [120355]

[2019-10-10T18:00:24,536][WARN ][o.e.a.b.TransportShardBulkAction] [bpcw-node-0] [[filebeat-6.7.1-2019.10.10][0]] failed to perform indices:data/write/bulk[s] on replica [filebeat-6.7.1-2019.10.10][0], node[ox1nA_KUSdGhAYL0KtJYTA], [R], s[STARTED], a[id=_h20wRHyTlKTxC0UA_H8JQ]
org.elasticsearch.transport.NodeDisconnectedException: [bpcw-node-2][10.176.16.128:9300][indices:data/write/bulk[s][r]] disconnected
[2019-10-10T18:00:24,903][INFO ][o.e.c.s.MasterService    ] [bpcw-node-0] zen-disco-node-left({bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300}), reason(left)[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300} left], reason: removed {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300},}
[2019-10-10T18:00:24,967][INFO ][o.e.c.s.ClusterApplierService] [bpcw-node-0] removed {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300},}, reason: apply cluster state (from master [master {bpcw-node-0}{mbL724baT0Keoq3KgbLo4w}{LN3f8aHRRuiBnCHjv3Ei9w}{10.176.16.115}{10.176.16.115:9300} committed version [16759] source [zen-disco-node-left({bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300}), reason(left)[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{4hD5t1ODQiipG5Ynwelf3Q}{10.176.16.128}{10.176.16.128:9300} left]]])
[2019-10-10T18:00:24,988][INFO ][o.e.c.r.DelayedAllocationService] [bpcw-node-0] scheduling reroute for delayed shards in [59.4s] (342 delayed shards)
[2019-10-10T18:00:24,994][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [logstash-2019.10.10][2] marking unavailable shards as stale: [TK-zMYJcQbmW1caUGc1X_Q]
[2019-10-10T18:00:24,995][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [filebeat-6.7.1-2019.10.10][0] marking unavailable shards as stale: [_h20wRHyTlKTxC0UA_H8JQ]
[2019-10-10T18:00:24,995][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [filebeat-6.7.1-2019.10.10][2] marking unavailable shards as stale: [Deu2-x7_R9ul7cXJyeulIg]
[2019-10-10T18:00:25,311][INFO ][o.e.m.j.JvmGcMonitorService] [bpcw-node-0] [gc][15272] overhead, spent [339ms] collecting in the last [1s]
[2019-10-10T18:00:26,353][WARN ][o.e.t.OutboundHandler    ] [bpcw-node-0] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.176.16.128:41960}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-10-10T18:00:27,359][WARN ][o.e.t.OutboundHandler    ] [bpcw-node-0] send message failed [channel: Netty4TcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.176.16.128:41954}]
java.nio.channels.ClosedChannelException: null
	at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]
[2019-10-10T18:00:28,249][WARN ][o.e.c.r.a.AllocationService] [bpcw-node-0] [logstash-2019.10.10][0] marking unavailable shards as stale: [Yn4JSvWpR_CggQ5GqjsRww]

[2019-10-10T18:01:36,637][INFO ][o.e.c.s.MasterService    ] [bpcw-node-0] zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}], reason: added {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300},}

[2019-10-10T18:02:06,677][WARN ][o.e.d.z.PublishClusterStateAction] [bpcw-node-0] timed out waiting for all nodes to process published state [16763] (timeout [30s], pending nodes: [{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}])
[2019-10-10T18:02:06,677][INFO ][o.e.c.s.ClusterApplierService] [bpcw-node-0] added {{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300},}, reason: apply cluster state (from master [master {bpcw-node-0}{mbL724baT0Keoq3KgbLo4w}{LN3f8aHRRuiBnCHjv3Ei9w}{10.176.16.115}{10.176.16.115:9300} committed version [16763] source [zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}]]])
[2019-10-10T18:02:06,701][WARN ][o.e.c.s.MasterService    ] [bpcw-node-0] cluster state update task [zen-disco-node-join[{bpcw-node-2}{ox1nA_KUSdGhAYL0KtJYTA}{3AL0nv_STPK-YMLiGmBEQQ}{10.176.16.128}{10.176.16.128:9300}]] took [30s] above the warn threshold of 30s

After this I set cluster.routing.allocation.enable to all. All nodes register this and there are no more errors. The cluster state looks like this.

{
  "cluster_name": "bpcw-dev-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "active_primary_shards": 513,
  "active_shards": 684,
  "relocating_shards": 0,
  "initializing_shards": 2,
  "unassigned_shards": 340,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 1,
  "number_of_in_flight_fetch": 954,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 66.66666666666666
}

The config files are programmatically built, but this is an example of elasticsearch.yml

cluster.name: bpcw-dev-cluster
node.name: bpcw-node-2
network.host: 127.0.0.1
transport.host: 10.176.16.128
http.port: 27959
discovery.type: zen
discovery.zen.ping.unicast.hosts: elasticsearch-cluster.service.consul
discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 121s
path.data: /usr/share/elasticsearch/data  

The expected behavior is there should be no initializing shards, or unassigned shards. The cluster eventually returns to green, but the entire benefit of portworx is lost.

That's not really what you should expect when a node rejoins the cluster. Each replica starts out unassigned (if it used to be assigned to the node that left the cluster) and must be assigned and go through the same basic initialisation (a.k.a. recovery) process before becoming active. Elasticsearch limits the number of concurrent recoveries to avoid overwhelming the cluster. Normally the recovery process should be pretty quick if there have been no changes since Elasticsearch can re-use the data that's already on disk. However I see some messages saying marking unavailable shards as stale indicating that these shards have been changed while the replicas were offline, and therefore take more work to recover as they will no longer be synced-flushed.

This isn't the full story, and improving recovery speed is an ongoing area of work even today, but hopefully this gives you a more accurate picture of what's going on. You may find the index recovery API gives you a more useful picture of how these shards are recovering in your cluster.

1 Like