Failing shards during indexing. Storage issue or misconfiguration? Using kubernetes

asp · July 11, 2019, 1:18pm

Hi,

I've setup kubernetes on baremetal and try to run a three node cluster there. In my dev environment I am currently using glusterfs which also runs in kubernetes for storage.

I am getting these kinds of errors:

"type":"server",
   "timestamp":"2019-07-11T13:03:12,139+0000",
   "level":"WARN",
   "component":"o.e.c.r.a.AllocationService",
   "cluster.name":"poc",
   "node.name":"poc-es-master-1",
   "cluster.uuid":"dXOiKR5_Qsu1ZSSsQ8-8qw",
   "node.id":"rQnFrxrFSsiXgGmshVtGGg",
   "message":"failing shard [failed shard, shard [plx_session-2019.w28][0], node[rQnFrxrFSsiXgGmshVtGGg], [R], s[STARTED], a[id=GbmKjl78SnS7IUiM-G23-Q], message [failed to perform indices:data/write/bulk[s] on replica [plx_session-2019.w28][0], node[rQnFrxrFSsiXgGmshVtGGg], [R], s[STARTED], a[id=GbmKjl78SnS7IUiM-G23-Q]], failure [RemoteTransportException[[poc-es-master-1][192.168.198.184:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[translog is already closed]; ], markAsStale [true]]"
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [poc-es-master-1][192.168.198.184:9300][indices:data/write/bulk[s][r]]",
"Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed",
"at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1778) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.translog.Translog.add(Translog.java:535) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:872) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:726) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:416) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:386) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:373) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79) ~[elasticsearch-7.1.1.jar:7.1.1]",
...

Here at paste.ee I've placed logfile and kubernetes statefulset configuration I am using.

Is this issue triggert because of the underlying storage provider glusterFs, or is something misconfigured in my es-cluster which has nothing to do with the underlying storage?

Thanks,
Andreas

HenningAndersen · July 11, 2019, 1:55pm

Hi @asp,

on the surface this looks like a storage problem. It seems the nodes thinks some of the files got truncated or otherwise corrupted behind their back.

As far as I can see, the configuration starts 3 nodes (k8s is not my strong side). Do they all have similar issues in their logs?

I wonder if you could be using the same path/mount for all nodes? If glusterfs does not support or is not configured to have proper file locking, weird things could certainly happen. It is recommended to have separate data paths per node.

asp · July 11, 2019, 1:58pm

I will check if all nodes have similar errors.

Each node has it's own volume, so in fact data per node is logically separated.

system · August 8, 2019, 1:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ElasticSearch Storage and Kubernetes Elasticsearch	3	1034	March 27, 2019
Shard recovery fails after resizing Google Cloud Platform's Persistent Disk Elasticsearch	13	1513	October 23, 2017
Failed shard on node [bnK31ibrRG6bSpw_pYK2BA]: shard failure, reason [corrupt file (source: [index id[CTR0CY4BdvDW7Z2cSc8C] origin[PRIMARY] seq#[27209007]])], failure org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Elasticsearch	9	413	April 15, 2024
Geoip_databases index allocation failed Elasticsearch	2	1231	April 28, 2022
Failed to snapshot shard# 2 Elasticsearch docker	1	317	January 20, 2023

Failing shards during indexing. Storage issue or misconfiguration? Using kubernetes

Related topics