Failing shards during indexing. Storage issue or misconfiguration? Using kubernetes

Hi,

I've setup kubernetes on baremetal and try to run a three node cluster there. In my dev environment I am currently using glusterfs which also runs in kubernetes for storage.

I am getting these kinds of errors:

"type":"server",
   "timestamp":"2019-07-11T13:03:12,139+0000",
   "level":"WARN",
   "component":"o.e.c.r.a.AllocationService",
   "cluster.name":"poc",
   "node.name":"poc-es-master-1",
   "cluster.uuid":"dXOiKR5_Qsu1ZSSsQ8-8qw",
   "node.id":"rQnFrxrFSsiXgGmshVtGGg",
   "message":"failing shard [failed shard, shard [plx_session-2019.w28][0], node[rQnFrxrFSsiXgGmshVtGGg], [R], s[STARTED], a[id=GbmKjl78SnS7IUiM-G23-Q], message [failed to perform indices:data/write/bulk[s] on replica [plx_session-2019.w28][0], node[rQnFrxrFSsiXgGmshVtGGg], [R], s[STARTED], a[id=GbmKjl78SnS7IUiM-G23-Q]], failure [RemoteTransportException[[poc-es-master-1][192.168.198.184:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[translog is already closed]; ], markAsStale [true]]"
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [poc-es-master-1][192.168.198.184:9300][indices:data/write/bulk[s][r]]",
"Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed",
"at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1778) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.translog.Translog.add(Translog.java:535) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:872) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:726) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:416) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:386) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:373) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79) ~[elasticsearch-7.1.1.jar:7.1.1]",
...

Here at paste.ee I've placed logfile and kubernetes statefulset configuration I am using.

Is this issue triggert because of the underlying storage provider glusterFs, or is something misconfigured in my es-cluster which has nothing to do with the underlying storage?

Thanks,
Andreas

Hi @asp,

on the surface this looks like a storage problem. It seems the nodes thinks some of the files got truncated or otherwise corrupted behind their back.

As far as I can see, the configuration starts 3 nodes (k8s is not my strong side). Do they all have similar issues in their logs?

I wonder if you could be using the same path/mount for all nodes? If glusterfs does not support or is not configured to have proper file locking, weird things could certainly happen. It is recommended to have separate data paths per node.

I will check if all nodes have similar errors.

Each node has it's own volume, so in fact data per node is logically separated.