We are experiencing an issue where Elasticsearch is not properly freeing up disk space after relocating shards. Specifically, even though the shards are relocated, disk space is not being released as expected. This issue persists for several days and is only resolved after restarting Elasticsearch (when i restart some node it resolve problem with space for it). i didn't find any WARN/ERR logs for the corresponding period as well.
es version: 6.8.13
number of nodes: 30
number of pr shards: 20
number of replicas: 2
index size: 498GB
FS: network fs
example for the node2
curl /_cat/shards?v | grep node2
index_v1 14 p STARTED 3893062 7.7gb node2
index_v1 6 p STARTED 3889496 8.9gb node2
we can see 2 active shard for node2
du -xhd5 /path/to/elasticsearch/data
8.1G ./0/indices/oH_DBsbyTLW5w6W-MnvoDg/2/index
8.4G ./0/indices/oH_DBsbyTLW5w6W-MnvoDg/0/index
7.5G ./0/indices/oH_DBsbyTLW5w6W-MnvoDg/16/index
7.8G ./0/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index - existing
9.0G ./0/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index - existing
as u can see shards stats API return info about 2 shards but in reality we have 5. To make sure these files are not being used by es or another process, I used lsof
lsof +D /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 855621 elasticsearch mem REG 252,16 113790489 2097355 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index/_4buc7_Lucene70_0.dvd
java 855621 elasticsearch mem REG 252,16 573036165 2097320 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index/_47t1i_Lucene50_0.doc
java 855621 elasticsearch mem REG 252,16 256570581 1310905 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index/_4bb4g.cfs
java 855621 elasticsearch mem REG 252,16 86633429 1311070 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index/_4bmie_Lucene70_0.dvd
...
only shards with number 6 and 14.
example of promlem shard
├── oH_DBsbyTLW5w6W-MnvoDg
│ ├── 0
│ │ ├── index
...
│ │ │ ├── _4cgsc.si
│ │ │ ├── _4cgsd.cfe
│ │ │ ├── _4cgsd.cfs
│ │ │ ├── _4cgsd.si
│ │ │ ├── _4cgse.cfe
│ │ │ ├── _4cgse.cfs
│ │ │ ├── _4cgse.si
│ │ │ ├── _4cgsf.cfe
│ │ │ ├── _4cgsf.cfs
│ │ │ ├── _4cgsf.si
│ │ │ ├── segments_7b
│ │ │ └── write.lock
│ │ ├── _state
│ │ │ ├── retention-leases-0.st
│ │ │ └── state-2.st
│ │ └── translog
...
│ │ ├── translog-2.ckp
│ │ ├── translog-2.tlog
│ │ └── translog.ckp
as for me all required files are presented. so it is valid shard.
summary:
- we have 3 shards that the es API doesn't return (for node2)
- these files are not used by es or other processes
- restart of es help to solve the problem
- it seems like es did something before starting and it solved my problem. probably u know what and how can i trigger it manualy without restart a node
- probably any ideas how to debug it
ps: during the forced merger we had the same problem from time to time. old ticket Elasticsearch don't remove old shards - #2 by warkolm