Elasticsearch Not Freeing Disk Space After Shard Relocation

We are experiencing an issue where Elasticsearch is not properly freeing up disk space after relocating shards. Specifically, even though the shards are relocated, disk space is not being released as expected. This issue persists for several days and is only resolved after restarting Elasticsearch (when i restart some node it resolve problem with space for it). i didn't find any WARN/ERR logs for the corresponding period as well.

es version: 6.8.13
number of nodes: 30
number of pr shards: 20
number of replicas: 2
index size: 498GB
FS: network fs


example for the node2

curl /_cat/shards?v | grep node2
index_v1           14    p      STARTED 3893062   7.7gb  node2
index_v1           6     p      STARTED 3889496   8.9gb  node2

we can see 2 active shard for node2

du -xhd5 /path/to/elasticsearch/data
8.1G	./0/indices/oH_DBsbyTLW5w6W-MnvoDg/2/index
8.4G	./0/indices/oH_DBsbyTLW5w6W-MnvoDg/0/index
7.5G	./0/indices/oH_DBsbyTLW5w6W-MnvoDg/16/index

7.8G	./0/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index - existing
9.0G	./0/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index - existing

as u can see shards stats API return info about 2 shards but in reality we have 5. To make sure these files are not being used by es or another process, I used lsof

lsof +D /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg
COMMAND    PID          USER   FD   TYPE DEVICE   SIZE/OFF    NODE NAME
java    855621 elasticsearch  mem    REG 252,16  113790489 2097355 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index/_4buc7_Lucene70_0.dvd
java    855621 elasticsearch  mem    REG 252,16  573036165 2097320 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/6/index/_47t1i_Lucene50_0.doc
java    855621 elasticsearch  mem    REG 252,16  256570581 1310905 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index/_4bb4g.cfs
java    855621 elasticsearch  mem    REG 252,16   86633429 1311070 /path/to/elasticsearch/data/indices/oH_DBsbyTLW5w6W-MnvoDg/14/index/_4bmie_Lucene70_0.dvd
...

only shards with number 6 and 14.

example of promlem shard

├── oH_DBsbyTLW5w6W-MnvoDg
│   ├── 0
│   │   ├── index
            ...
│   │   │   ├── _4cgsc.si
│   │   │   ├── _4cgsd.cfe
│   │   │   ├── _4cgsd.cfs
│   │   │   ├── _4cgsd.si
│   │   │   ├── _4cgse.cfe
│   │   │   ├── _4cgse.cfs
│   │   │   ├── _4cgse.si
│   │   │   ├── _4cgsf.cfe
│   │   │   ├── _4cgsf.cfs
│   │   │   ├── _4cgsf.si
│   │   │   ├── segments_7b
│   │   │   └── write.lock
│   │   ├── _state
│   │   │   ├── retention-leases-0.st
│   │   │   └── state-2.st
│   │   └── translog
            ...
│   │       ├── translog-2.ckp
│   │       ├── translog-2.tlog
│   │       └── translog.ckp

as for me all required files are presented. so it is valid shard.


summary:

  • we have 3 shards that the es API doesn't return (for node2)
  • these files are not used by es or other processes
  • restart of es help to solve the problem

  1. it seems like es did something before starting and it solved my problem. probably u know what and how can i trigger it manualy without restart a node
  2. probably any ideas how to debug it

ps: during the forced merger we had the same problem from time to time. old ticket Elasticsearch don't remove old shards - #2 by warkolm

That is a very old version. I would recommend you upgrade to a more recent and supported version as this has been EOL a long time.

How many of your nodes are master eligible?

Howe are these nodes configured? Do you have the correct minimum_master_nodes set?

How are you relocating shards? Do you have a tiered architecture? If so, how is this set up and configured?

thx for response :heart:

  1. How many of your nodes are master eligible? - all of them (mdi)
  2. Do you have the correct minimum_master_nodes set? - we don't have split brain if u about it
  3. How are you relocating shards? We had problems related to disk space for some of nodes and ES started relocating process to fix that. we didn't run this process manually

Given that you are on a very old version where this is a common issue which can have odd side effects I always look to verify that it is correct before proceeding with troubleshooting. I have seen it incorrectly set far to many times, causing numerous issues. Can you please verify that you have it set to 16 (or greater)?

How many indices and shards do you have in the cluster? Are these reasonably uniform in size?

Are all nodes configured with the same amount of available disk space?

got it

curl -X GET "/_cluster/settings?include_defaults=true&pretty" -s | grep minimum_master_nodes

        "minimum_master_nodes" : "16",
1 Like
min max avg
index_v12 458 GiB 851 GiB 562 GiB
index_v1 16.0 GiB 16.6 GiB 16.5 GiB
index_v3 674 MiB 678 MiB 678 MiB
index_v4 434 MiB 520 MiB 491 MiB
index_v5 3.44 MiB 3.55 MiB 3.51 MiB
.tasks 13.0 KiB 13.0 KiB 13.0 KiB

Is that the min, max and average index/shard sizes? If that is the case they are very uneven, which can cause problems in older versions of Elasticsearch as only shard count is considered. Given that one index is so much larger than the others you may want to use the index.routing.allocation.total_shards_per_node setting to force a more even distribution of the shards of this index.

this is just min/max size of each index during some period. in our case it is 3 days.

all nodes have pretty the same disk.indices size

shards disk.indices disk.used disk.avail disk.total disk.percent node
    11       16.7gb    33.9gb     24.8gb     58.8gb           57 node27
    11       16.4gb    42.1gb     16.6gb     58.8gb           71 node2
    12       17.2gb    25.2gb     33.5gb     58.8gb           42 node24
    11       16.8gb    32.4gb     26.3gb     58.8gb           55 node25
    11       16.5gb    17.2gb     41.5gb     58.8gb           29 node18
    12       16.8gb    26.2gb     32.5gb     58.8gb           44 node10
    11       16.9gb    33.7gb       25gb     58.8gb           57 node29
    11       16.5gb    25.2gb     33.5gb     58.8gb           42 node12
    11         16gb    25.4gb     33.3gb     58.8gb           43 node26
    11       16.1gb    41.3gb     17.5gb     58.8gb           70 node8
    11       19.9gb    36.3gb     22.4gb     58.8gb           61 node5
    11       16.9gb    25.3gb     33.4gb     58.8gb           43 node23
    11       15.5gb    24.8gb     33.9gb     58.8gb           42 node17
    11         16gb    24.7gb       34gb     58.8gb           42 node6
    11         17gb    24.3gb     34.5gb     58.8gb           41 node3
    11       16.5gb    44.3gb     14.5gb     58.8gb           75 node22
    11       16.8gb    35.7gb       23gb     58.8gb           60 node1
    11         15gb      41gb     17.8gb     58.8gb           69 node21
    11       17.4gb    24.8gb     33.9gb     58.8gb           42 node13
    11       16.5gb      26gb     32.7gb     58.8gb           44 node14
    11       16.1gb    33.3gb     25.4gb     58.8gb           56 node28
    11       16.7gb    43.3gb     15.4gb     58.8gb           73 node15
    11       17.5gb    28.2gb     30.5gb     58.8gb           48 node9
    11       16.9gb    32.9gb     25.8gb     58.8gb           56 node7
    11       16.3gb    25.1gb     33.6gb     58.8gb           42 node0
    11       16.4gb    39.5gb     19.2gb     58.8gb           67 node4
    11       17.6gb    18.1gb     40.6gb     58.8gb           30 node20
    11       16.1gb    16.5gb     42.2gb     58.8gb           28 node16
    11       16.1gb    32.7gb       26gb     58.8gb           55 node19
    11       16.1gb    16.6gb     42.1gb     58.8gb           28 node11

the problem stems is "invisible" shards

shards disk.indices disk.used disk.avail disk.total disk.percent node
    11       16.7gb    33.9gb     24.8gb     58.8gb           57 node27

disk.indices 16.7gb
disk.used 33.9gb

The difference is in the shards that are represented in FS but not represented in ES API.

i can't manage that coz ES doesn't see these shards (as result i can't use API to move/delete these shards)


I don't know the reason why, but I know that after rebooting my problem goes away. I guess ES run something before starting which remove old shards and i wanna run that via API when i have a problem to avoid restart.

probably u have any idea what else i can to check for debug?

ps: I thought es was leaving those files open and so after restarting fs deleted it (after close files). But lsof shows me that nobody is using those files.
psps: we have network fs

Do you have any ongoing recoveries in the cluster?

only with done status

curl -X GET /_cat/recovery?v&h=i,s,t,ty,st,shost,thost,f,fp,b,bp -s | grep -v "done"

i                   s  t     ty          st   shost         thost         f   fp     b          bp

i got additional info from state-*.st files.

0/indices/
├── oH_DBsbyTLW5w6W-MnvoDg
│   ├── 6 ****
│   │   ├── _state
│   │   │   └── state-1.st
│   │        {
│   │             'allocation_id': {
                    'id': 'TI7iyKXHSTqbdmeOp3N9zw' <- node2
                  },
│   │             'index_uuid': 'oH_DBsbyTLW5w6W-MnvoDg',
│   │             'primary': True
│   │         }
│   ├── 0
│   │   ├── _state
│   │       └── state-2.st
│   │        {
│   │             'allocation_id': {
                    'id': 'M0bfmY8yQtWOl9GU9e-Dzw',
                    'relocation_id': 'jRaf-9jqQ_KNRsgnk9Q74Q'
                  },
│   │             'index_uuid': 'oH_DBsbyTLW5w6W-MnvoDg',
│   │             'primary': False
│   │         }
│   ├── 16
│   │   ├── _state
│   │       └── state-2.st
│   │        {
│   │             'allocation_id': {
                    'id': 'eaNCWJAjR2afh00xPiaUDw',
                    'relocation_id': '_pa13y3HQMeTjbIRXtPw7A' <- node13
                  },
│   │             'index_uuid': 'oH_DBsbyTLW5w6W-MnvoDg',
│   │             'primary': True,
│   │         } 
│   ├── 2
│   │   ├── _state
│   │      └── state-2.st
│   │        {
│   │             'allocation_id': {
                    'id': 'ao6BbIu4QxCB7wDCrC6IiA',
                    'relocation_id': '9fWWJXRrQZOj-E4-fpNQKQ' <- node15
                },
│   │             'index_uuid': 'oH_DBsbyTLW5w6W-MnvoDg',
│   │             'primary': False,
│   │         } 
│   ├── 14 ****
│   │   ├── _state
│   │       └── state-1.st
│   │         {
│   │             'allocation_id': {
                    'id': 'iV_bwpv8QjybLJaQoVihZw' <- node2
                   }, 
│   │             'index_uuid': 'oH_DBsbyTLW5w6W-MnvoDg',
│   │             'primary': True
│   │         }
│   └── _state
│       └── state-144.st
│           {'index_v12': { 'in_sync_allocations': 
│                                 {'0': ['DuSqGNsyQUeHb5zxcfPs6w',
│                                       'WT2AsE18TWWj8BX3oPk4Fw',
│                                       'aD5sDc0MRKqJ2H7bbQS7jQ'],
│                                  '1': ['4ph_DUQFSXOJJPbdp7fNqA',
│                                       '2pFuW_qrSMySbXImamV8DA',
│                                       'z6uTgcpeSYy8QIyUGdkrNA'],
│                                  '10': ['2qweHEnmQKa7YsrzKZcM4A',
│                                        'RS9O7ZBGSG63LgSkiMWXew',
│                                        'er79CYi9Rl2cjR_EP6Dojg'],
│                                  '11': ['y-rVMUFzQEqXrCU_Xzt2zw',
│                                        'nL6s8eQdQ_GtE21eA6qSag',
│                                        'U3NCokXlQtSzyocknAOnsg'],
│                                  '12': ['x-c6E6cdTm-BmBtI4v6QFQ',
│                                        'R5fcQZmiQd2xYgrYJYBWHQ',
│                                        'jAbfrc6_QcO6BBab-XTA2Q'],
│                                  '13': ['qv98O1coTwu9qE0_zP3q6Q',
│                                        '3vwbfzJ2T1uoXe2_vbZcWw',
│                                        'F1MxMCSbRH2gaY6uUP0w5w'],
│                                  '14': ['cnSg2qV4S8CF28Z3GhW4yg',
│                                        'bDhBFCJvTF6Yx_Rse64FcA',
│                                        'iV_bwpv8QjybLJaQoVihZw'],  <--- node2
│                                  '15': ['PKyM8ngQT-SGSsBO5YDxAw',
│                                        'XMU_4kNeQc-oWfteK8HSiw',
│                                        'sbt5hDJJRtu3WKX3bI7mAw'],
│                                  '16': ['wYmXj7GIQXCyNtz7GBqukQ',
│                                        'Fo_k1V-JQoady9p8rRk9bQ',
│                                        '_pa13y3HQMeTjbIRXtPw7A'],
│                                  '17': ['3Vuj3KisRsmi9YUZZvGLdg',
│                                        'Pfj67Y95ReW9_vJfbAJ6LA',
│                                        'pXILaYiaR7KMDJdOgPxjyg'],
│                                  '18': ['xpfEwa3vRiSMIv8Hx1Mkhw',
│                                        'Geu2SlHJRf2hcH8X1dlM2Q',
│                                        'sKuj6vYWRw6K01YUw1_5Tg'],
│                                  '19': ['HMdBlYBBTCq5XjYrKTnbhA',
│                                        '-obg-9YvQ9azzxmTyPWqCg',
│                                        'UOO_AIAQSI2ZNvBLS2D5Dw'],
│                                  '2': ['lCboPkgySDefYDpVQbMZ6g',
│                                       'oh75j9pgSYyCrvN6J8Kvew',
│                                       '9fWWJXRrQZOj-E4-fpNQKQ'],
│                                  '3': ['AS6rytZITWSlBQ_Sx7JbGQ',
│                                       '1a6tQT2ASJWJn0hVYLmATA',
│                                       'RyOk5hfvSHSWQUrVmWeqBQ'],
│                                  '4': ['Ks6AyfGFQzeQUaRQdJHeOQ',
│                                       'A9Pb-zdiQEyZKsvn1sWBKA',
│                                       'QkwFg0vuTOK5FjT7YvkKog'],
│                                  '5': ['yOUcaN0XRJCfLEsF3VmOVg',
│                                       'MQJ8ONSXQeSNC03769u4rA',
│                                       '9iif8YcoRNKJZd4Gb0r7iA'],
│                                  '6': ['TI7iyKXHSTqbdmeOp3N9zw',  <--- node2
│                                       'i4M_r04hSqSOVpCtydmeew',
│                                       'V_R8wnbQQU-7kyfgWJaF2Q'],
│                                  '7': ['XLT3X9Z_TNKW8X46xp5Igg',
│                                       'TwM_qXb7SwixHBlFs30YjA',
│                                       'gCE5AoxvREemkgmBvmm8Lg'],
│                                  '8': ['YOTyP2mhSvKLNowtloz0ow',
│                                       'rBDG64s1Qkq_18r4ZE57oQ',
│                                       'a6Vu4eUiSl6YiPkBz7m6GA'],
│                                  '9': ['9r2V9H3AS3CPlE8Aexl4tA',
│                                       '0dlAqUHxS_qNULlyPX9ubg',
│                                       'x9Ykie0ZR96CL8aUhMHLaA']},
│            ...
│            'state': 'open',
│            'version': 154}}

as we can see that invisible/outdated shards have relocation_id field. if i understand correct then this information we have to see here

curl -X GET /_cluster/allocation/explain

but we don't (and it is a normal coz we don't have unassigned shards).


i used the cluster state API to compare it with info which i extracted from *.st files.

curl '/_cluster/state?pretty'

info about in_sync_allocations and states for 6 and 14 shard are equal.

shards:

  • 0 - we don't have any match for allocation_id and relocation_id
  • 16 - success relocation to the node13 (relocation_id)
  • 2 - success relocation to the node15 (relocation_id)

i think that for the shard with number 0 we had one more relocation as result we don't have success match by relocation_id.

the main question is why after successful move the shards are still stuck to the fs?

do u have any ideas why?

ps: we have the same status for each shard ?\xd7l\x17\x05state\x00\x00\x00\x01\x00\x00\x00\x01:) looks like state 1 1

I haven't found any errors or warnings for a long period of time. Only

  • TransportSearchAction
  • SearchPhaseExecutionException
  • RemoteTransportException

I tried to find a place where es remove the shards and found the method IndicesClusterStateService.applyClusterState

@Override
    public synchronized void applyClusterState(final ClusterChangedEvent event) {
        ...

        updateFailedShardsCache(state);

        deleteIndices(event); // also deletes shards of deleted indices

        removeIndices(event); // also removes shards of removed indices

        failMissingShards(state);

        removeShards(state);   // removes any local shards that doesn't match what the master expects

        updateIndices(event); // can also fail shards, but these are then guaranteed to be in failedShardsCache

        createIndices(state);

        createOrUpdateShards(state);
    }

as for me that is what i actually need. so I enabled debug logs and rebooted one of the nodes to check the logs why this method was not cleaning up the old shards.

and guess what

if i understand correct than after reboot es start rebalance the cluster and change its state. as result call IndicesClusterStateService.applyClusterState and clean up a space (this is my guess).


when i return node to the cluster (after reboot) and cluster finished rebalancing i found one more stuck shard again :sweat_smile: for the node6

  1. the first plunge (17:15) is the moment when i reload a node
  2. grew because one of the shard was moved to this node
  3. but stuck after return node back to the cluster

I've tried to run IndicesClusterStateService.applyClusterState so just create and delete index

curl -X DELETE temp_index
curl -X DELETE temp_index

i din't discover any effeect for the node6 but got at this moment disk clean up for other node

summary:

  1. I didn't find why the shards got stuck - any ideas?
  2. probably i found how to fix it without node reboot - could u confirm that IndicesClusterStateService.applyClusterState run after each create or delete index and it fixed my problem? Or do you think we have another reason why this is the case?
  3. didn't get why i got one more stuck shard and why after update state es didn't free up disk space. probable es doesn't remove shards immediately and we need time for that. if it is true could u tell me how es desided that it is time for remove old shard and how to monitoring it?

I do not know. My memory is sketchy going back that far as I have not used this version in many years.

as for me this part is still preatty the same

anyway, thx for ur time sir