Node does not delete data from disk

HI,

one of my data nodes does not remove data from disk after I delete an index.

Another node which holds 100% of the same data does so without a problem.

The folders in the data directory do not get removed and don't change their size after deleting an index.

Restarting the node or the server did not help at all.

EDIT1: For now I can delete data manually after deleting an index, since the node does not recognize that it still holds the data. The data is also no longer accessible
EDIT2: The logs do not say anything at all...

Luca

+1

Exactly the same problem. After restart, all data from replicas doesn't get removed and prevents new replicas from being allocated, cluster in yellow state, no relocation/recovery is happening, nothing in logs on DEBUG level.

ES 1.6 and I don't want to move the next version without current cluster reaching green state and all data being safely replicated and kept in-sync.

I'm on 1.7
My Cluster is green. Its just that deletion is not possible

Mine would be green too, but it hit the low disk watermark because of this problem :slight_smile:
So I think it's pretty crucial to learn the reasons

The only reason mine is green is because it started to ignore any watermark there is.. :smiley:

Sorry to bother you @spinscale but this does look like an interesting bug.

Hey,

does lsof show any open deleted files from that index/shard?
What kind of storage is this? Local disk? Any weird filesystems in action?

--Alex

There are no open files from the supposedly deleted indices.

The storage is a little strange.
Local storage.
2 HDDs, RAID0 with mdadm -> physical volume1
1 HDD -> physical volume2
pv1+pv2 -> logical volume1 (lvm)

We had to go this way as we added one disk just recently and had problems expanding the existing raid.

We have 8 TB data and 2k shards, so I physically cannot do it on all such shards, but several probes showed that there are no open files from that directories.

Local storage, HDD, ext4

BTW, this just caused one of the indices to stop serving requests, since replicas quorum wasn't met =__=

We also run on ext4.

Just so I can compare it to our setup:
How many RAM does your server have?
How many HDDs of which size does your server have?

8 nodes, 64 GB RAM, 10 GB JVM Heap
1 HDD on each node, 2TB each

Right now, all nodes reached low disk watermark (100GB)

Hey,

did you check dmesg output, and syslog to ensure there is no hardware issue? I assume writing on disks works as expected with sth like bonnie or dd?

I am not aware of any issues in that regard that were recently fixed or open on 1.x, but maybe I am missing something.

--Alex

I built a script yesterday which matches indices in the data directories with the output of the cat indices API and deletes(rm -rf on the files) those that should have been deleted.
For some reason the node is 100% working right now. I can delete data with the usual delete REST call.

The disk does not appear to be broken in any was, as everything is fine now..this is so strange.

FYI, we just started with our 2.x cluster. We hope to switch soon, so I don't mind if we do not find the reason for this problem. I just pinged you since this looked like a weird bug which may need to be fixed, if it really is one.
For now its just very hard to find the reason, as there is no evidence of any kind which could point us in the right direction. And there are so many parameters playing a big role here. I could think of 20 small things which could have been responsible

Luca

It'd be useful if you gave us the output of the delete command, including _cat/indices before and after showing the index in question.

At the moment you haven't given anything concrete to work off.

I know.
The reason for it is that everything behaved normal. The only difference was that data on one node did not get deleted.
The output of the delete call was the usual acknowledged true.
The logs did not say anything unusual.

But since this problem is solved now for some reason, I am not able to give you more input. Sorry!