ES 8.4.3: node disconnected after disk is full

I followed the helm chart example, and created a 3-node elasticsearch 8.4.3 instance, and it was running OK until running out of the data volume space.

The first issue I encountered was not able to open kibana. I XDELETE one of the biggest index through API interface, and the problem was fixed after a short period.

This morning, I noticed one non-master node's status is still Running but colored as BLUE, and it's said that "Readiness probe failed." It appears the node is offline and not accessible by the master anymore. Then, I found the disk is full again. I believe that's because a number of "RELOACATING" moved lots of data to this node.

I feel this is a bug, but I'd like to confirm this before submitting it to the GitHub. Please let me know if more info should be provided.

In addition, is there a graceful way to purge/empty the data volume in this situation? I looked at the bin directory and couldn't figure out which tool may help.

By the way, I realized that, since the default setting was used, the default policy sets the limit to 50gb, but the helm chart example defaults to only 30gb. The lifecycle policy was never triggered.

/ $ curl -k -u elastic:elasticsearch -X GET "https://elasticsearch-master:9200/_cat/nodes?v=true&pretty"
ip           heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
10.42.15.185           55          63   2    0.03    0.07     0.08 cdfhilmrstw -      elasticsearch-master-0
10.42.5.129            50          64   4    0.03    0.07     0.10 cdfhilmrstw *      elasticsearch-master-2
elasticsearch¾elasticsearch-master-1:´$ df
Filesystem                                             1K-blocks     Used Available Use% Mounted on
overlay                                                108241468 49320008  53400052  49% /
tmpfs                                                      65536        0     65536   0% /dev
tmpfs                                                    8164656        0   8164656   0% /sys/fs/cgroup
/dev/nvme0n1p3                                         108241468 49320008  53400052  49% /etc/hosts
shm                                                        65536        0     65536   0% /dev/shm
/dev/longhorn/pvc-8e3c241c-092f-456f-9e8b-255157fd3d25  30832548 30816164         0 100% /usr/share/elasticsearch/data
tmpfs                                                   16329316       12  16329304   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                                                   16329316       16  16329300   1% /usr/share/elasticsearch/config/certs
tmpfs                                                    8164656        0   8164656   0% /proc/acpi
tmpfs                                                    8164656        0   8164656   0% /proc/scsi
tmpfs                                                    8164656        0   8164656   0% /sys/firmware
Wed Dec 14 06:30:35 UTC 2022,.ds-metrics-xxxxxx-default-2022.12.11-000002,0,p,RELOCATING,167670950,24.2gb,10.42.15.185,elasticsearch-master-0,->,10.42.16.243,dnlqpEz8SCCV3znZ63Vj4A,elasticsearch-master-1

Welcome to our community! :smiley:

You need to delete data via the APIs, not the disk, so try removing more indices.

Since the node is not accessible by the master, I cannot delete the data via API. As you can see, the node is even not listed by API (the first curl command result).

In case I misunderstood your comments, could you please clarify what the command would be to delete the data via the APIs?

I was monitoring the size by running the following command very minutes, curl -k -u elastic:xxxxxxxx -X GET "https://elasticsearch-master:9200/_cat/shards?v=true&pretty"

The last returned responsing still with the node, elasticsearch-master-1, is a list of small indices with the following two big ones. After that, the -1 node is no longer listed by this API command. To my understanding, there is no way to delete the index on -1 node anymore in this scenario. Please confirm.

.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     r      RELOCATING 191355419   23.8gb 10.42.5.129  elasticsearch-master-2 -> 10.42.15.185 h5vBT_YSRHqKthPnyngu3Q elasticsearch-master-0
.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     p      STARTED    191472994   28.6gb 10.42.16.243 elasticsearch-master-1

1 minute later, it became,

.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     p      STARTED    191573442  23.8gb 10.42.5.129  elasticsearch-master-2
.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     r      UNASSIGNED

The next two commands (1m interval still) took about 30seconds and returned an error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Then, the result became,

.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     p      STARTED      191598153   23.9gb 10.42.5.129  elasticsearch-master-2
.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     r      INITIALIZING                    10.42.15.185 elasticsearch-master-0

After about another 9 minutes, the initializing finished, and it became,

.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     r      STARTED 192204992   24.1gb 10.42.15.185 elasticsearch-master-0
.ds-metrics-apm.app.xxxxx-default-2022.12.11-000002        0     p      STARTED 192113711   26.1gb 10.42.5.129  elasticsearch-master-2

I would start here by removing your replicas and stabilising the cluster, it's a bit of a risk if you totally lose a node, but if you remove them and can add the other cluster back in you can then remove some other indices to bring the load down and go from there.

Do you mean there are no way to bring the "elasticsearch-master-1" node back?

This is a 3-node cluster, from elasticsearch point of view, and the other two nodes were still good when I noticed the issues (hours after node1 got lost). I did delete the big indices on node0 and node2, which did not bring node1 back (which is not a surprise).

Since I'm still learning and getting familiar with elasticsearch, this is only a test setup, and nothing is critical yet. I just wish this won't happen in production.

Again, should I report it as a bug?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.