Hi everyone!
I have an ECK operator managed ES cluster of 3 nodes which I am trying to spin up with new persistent volume claims that are using old persistent volumes from a previous ES cluster.
In order to do this I deployed an elasticsearch
manifest with a volume claim template specifying 1Tib PVCs for each node. Once the cluster was running I then deleted the newly created PVCs and their PVs in order to create PVCs that are connected to the old PVs. Once my new PVCs were bound to the old PVs I rebooted each ES node to pick up the new data but node-2, the last node, fails to start.
Node-2 has the following error:
java.io.UncheckedIOException: Failed to load persistent cache
Likely root cause: java.io.IOException: No space left on device
at java.base/sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at java.base/sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62)
at java.base/sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:79)
at java.base/sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:285)
at java.base/java.nio.channels.Channels.writeFullyImpl(Channels.java:74)
at java.base/java.nio.channels.Channels.writeFully(Channels.java:97)
at java.base/java.nio.channels.Channels$1.write(Channels.java:172)
at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:416)
at java.base/java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at org.apache.lucene.store.OutputStreamIndexOutput.getChecksum(OutputStreamIndexOutput.java:80)
at org.apache.lucene.codecs.CodecUtil.writeCRC(CodecUtil.java:569)
at org.apache.lucene.codecs.CodecUtil.writeFooter(CodecUtil.java:393)
at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:582)
at org.apache.lucene.index.SegmentInfos.write(SegmentInfos.java:485)
at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:803)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5088)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3461)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3771)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3729)
at org.elasticsearch.xpack.searchablesnapshots.cache.full.PersistentCache$CacheIndexWriter.commit(PersistentCache.java:524)
at org.elasticsearch.xpack.searchablesnapshots.cache.full.PersistentCache.repopulateCache(PersistentCache.java:261)
at org.elasticsearch.xpack.searchablesnapshots.cache.full.CacheService.doStart(CacheService.java:201)
at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:48)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1087)
at org.elasticsearch.node.Node.start(Node.java:801)
at org.elasticsearch.bootstrap.Bootstrap.start(Bootstrap.java:311)
<<<truncated>>>
I'm not sure what no more space left on device
means exactly but the persistent disk allocated to it has 1TB which is 500x more space then it originally had in the old ES cluster. Additionally the persistent volume in k8s is set to 1Tib
:
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: ssd
volumeMode: Filesystem
volumeName: pvc-9dd48c27-4713-4453-a5e9-9334b0fd0e28
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 1Ti
phase: Bound
and the corresponding PVC is set to 1Tib
too
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
common.k8s.elastic.co/type: elasticsearch
elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-prod-backup
elasticsearch.k8s.elastic.co/statefulset-name: elasticsearch-prod-backup-es-default
name: elasticsearch-data-elasticsearch-prod-backup-es-default-1
namespace: eck-prod-backup
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Ti
storageClassName: ssd
volumeMode: Filesystem
volumeName: pvc-3fc7c42b-6d1a-4f2b-aa69-eb6d67aee425
I'm thinking it's possible I may have missed some ECK operator related metadata such as a label/annotation in the above PVC claim which would cause node-2 to not find it, thus saying no space is available. But that's just a thought
The other 2 ES pods (node-0 and node-1) loaded the PVC fine and are running, though they error with the following message
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-kibana-7-2021.07.05][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-kibana-7-2021.07.05][0]] containing [2] requests]]"
which I'm assuming is because node-2 is failing to start.
I would greatly appreciate any ideas on this!
Thank you for reading all of this, it was a lengthy post!