We a single node ES cluster with roughly 600 gig assigned to it. Every so often it would get above the high water mark and I'd run curator usually deleting any indices older than 60 days. In the past this would free up several hundred gigs of space and we'd be good to go for a while. I noticed recently that when I'd run curator, we'd get less and less free space back. Today when I got alerted that the disk space was full, I ran curator and got almost no free space back.
I'm not exactly sure what changed but I would love to get some help because I'm not sure what could be causing this. Curator still says it's deleting indexes but I'm not sure why the space isn't being reclaimed.
I thought maybe ES was too busy trying to keep up writing data AND purging space so I let it run for a while today with logstash disabled so that no new logs were coming it. This didn't seem to have any impact.
Here's some info from the server:
root@ps-prod-elk:/var/log/elasticsearch# cd /var/lib/elasticsearch/nodes/0/
root@ps-prod-elk:/var/lib/elasticsearch/nodes/0# du -shx .
571G .
root@ps-prod-elk:/var/lib/elasticsearch/nodes/0# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/prod--elk--vg00-lv--root 618G 576G 11G 99% /
root@ps-prod-elk:/var/lib/elasticsearch/nodes/0# curl 'localhost:9200/cat/indices?v'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open logstash-2017.12.28 ELoaXq5kQ-6EdWCbeU2bFg 5 0 105281 0 40.7mb 40.7mb
green open logstash-2017.12.21 lq5Y4phGQDS6p4AxDS_Cbg 5 0 111098 0 42.5mb 42.5mb
green open logstash-2017.10.30 u8_kuBoSRyOxPw17VKWtUQ 5 0 21254606 0 17.8gb 17.8gb
green open logstash-2017.12.07 N1O-mjJ7QkKYsSejrkdB_Q 5 0 109788 0 42.7mb 42.7mb
green open logstash-2017.11.10 Cet5LarcSdSCbShEDR7oZQ 5 0 21299070 0 18.6gb 18.6gb
green open logstash-2017.12.03 ZBVXQ1wNSnmBAe50KBZrdA 5 0 113258 0 43.6mb 43.6mb
green open logstash-2017.11.11 nx-9J50yTsmUVFzrLCXr7Q 5 0 10336451 0 7.6gb 7.6gb
green open logstash-2017.12.13 k6rtVbPsSqe2W2jWm9h8bw 5 0 109179 0 41.7mb 41.7mb
green open logstash-2017.11.13 nFzSN1JzSgWsL4Kf1Sze-Q 5 0 24398517 0 21.6gb 21.6gb
green open logstash-2017.12.18 Cy5ekdXeTw-ts7HUuO5pyQ 5 0 104397 0 40.5mb 40.5mb
green open logstash-2017.12.11 8h8054B6QK-KkULc5-IA1Q 5 0 111067 0 43mb 43mb
green open logstash-2017.12.26 11QiJeEzTDCUk7GYKgnvBA 5 0 114170 0 43.1mb 43.1mb
green open logstash-2017.10.29 Jt2L8swQS1K5LrU1NUPKwA 5 0 9087462 0 6.1gb 6.1gb
green open logstash-2017.11.15 TNECzWEMR32pg7m7Loh0VQ 5 0 26008448 0 23.2gb 23.2gb
green open logstash-2017.11.27 6-QE7cPgT_CPyz-gP94NpA 5 0 107056 0 41.4mb 41.4mb
green open logstash-2017.12.19 9sGxQ6urT-W3ZfckMe58g 5 0 106945 0 40.7mb 40.7mb
green open logstash-2017.12.14 4IzkQngSQW-mlpEsk4jRhg 5 0 108428 0 41.7mb 41.7mb
green open logstash-2017.12.10 P-DwseVeRv2w6dyKkKzqyw 5 0 132400 0 51.1mb 51.1mb
green open logstash-2017.11.12 W-8zBAQsQI-LXpn7olHQhw 5 0 9726476 0 7gb 7gb
green open logstash-2017.10.31 ISMDie9qSu6dFm-VaGe6Hg 5 0 20832737 0 17.5gb 17.5gb
green open logstash-2017.10.22 BBpBZ2HrT5O5_9ABSUjnpA 5 0 7474561 0 4.4gb 4.4gb
green open logstash-2017.12.09 ZCt5ef6BSOWJTiYG8vOnJQ 5 0 113875 0 43.8mb 43.8mb
green open logstash-2017.10.26 aofj3vvaR_yF7qfvfC64zw 5 0 17958135 0 14.7gb 14.7gb
green open logstash-2017.12.06 oNPaiKgJQrG8Qe2np_9nqQ 5 0 110867 0 42.9mb 42.9mb
green open logstash-2017.11.26 iLhzuu_DQSy7hgqivC5dTQ 5 0 110007 0 42.9mb 42.9mb
green open logstash-2017.10.21 mOR9b8h9To6k9wmWQ20THg 5 0 7569658 0 4.5gb 4.5gb
green open logstash-2017.11.20 tO98PDi-TZmp44pEi7QlDw 5 0 21880906 0 19.2gb 19.2gb
green open logstash-2017.12.30 RnS2jBOLTam7T7JKd_IpNA 5 0 106561 0 40.9mb 40.9mb
green open logstash-2017.10.25 w103dmdrTWC4fuYB258lnQ 5 0 18550266 0 15.1gb 15.1gb
green open logstash-2017.12.08 vT1ntHdPQ6iaW8tUiCBsEw 5 0 114595 0 44.2mb 44.2mb
green open logstash-2017.11.17 gH0tvwIuQ5KiYjU7T6HWvA 5 0 25816269 0 22.8gb 22.8gb
green open logstash-2017.11.01 IJXZxIsRS92YlZ_2v-pJfQ 5 0 21190946 0 17.8gb 17.8gb
green open logstash-2017.10.10 gD78cz4wQcKmo79-keIHKA 5 0 12539609 0 9.3gb 9.3gb
green open logstash-2017.11.07 N0zDo_IPQh-MWjhedJXYDA 5 0 26225074 0 22.9gb 22.9gb
green open logstash-2017.11.22 uyZfssiZSNuJMb5PqnFJiw 5 0 53483 0 23.1mb 23.1mb
green open logstash-2017.10.12 cCL2fNIGS-2X7eAeOL09bg 5 0 12760386 0 9.5gb 9.5gb
green open logstash-2017.11.24 60PALkOsQV6iSWQeRV2bvA 5 0 86154 0 34.1mb 34.1mb
green open logstash-2017.11.19 CyAIgJIRTOiRGezGm92D-w 5 0 10343504 0 7.5gb 7.5gb
green open logstash-2017.11.14 Voc65JIUTdaPTUFLTXzhYA 5 0 24914240 0 22.1gb 22.1gb
green open logstash-2017.10.27 CcxBCyZsSuOPByF0JUWkLA 5 0 17309126 0 14gb 14gb
green open logstash-2017.10.24 7w90aptpQqm7SvkUI4o46g 5 0 20333119 0 16.7gb 16.7gb
green open logstash-2017.11.02 vmhM-VdARri2-xH692KkVA 5 0 21351775 0 17.8gb 17.8gb
green open .kibana rf17ELguQjq-I5ds66xnrw 1 0 80 2 131.8kb 131.8kb
green open logstash-2017.11.05 XTArEU7xQASiDW34cZhL-A 5 0 9759635 0 6.8gb 6.8gb
green open logstash-2017.10.16 bRjuZjyKS8eRNIKkDl8kQQ 5 0 13891992 0 10.5gb 10.5gb
green open logstash-2017.11.06 mtd3A-PvTKe31ZpapgGaFA 5 0 25731549 0 22.4gb 22.4gb
green open logstash-2017.10.15 TpVc_fZdTo-KpnfjtcngvQ 5 0 7623641 0 4.4gb 4.4gb
green open logstash-2017.10.11 eWZPJyzST8SDATZpbuMNBg 5 0 12831294 0 9.6gb 9.6gb
green open logstash-2017.12.17 MmJl9P2BRQaJHmqC-V88eQ 5 0 102134 0 40.1mb 40.1mb
green open logstash-2017.12.29 Mo68OS-TTMCvMdi1F3zuxw 5 0 102029 0 39.2mb 39.2mb
green open logstash-2017.11.16 hYAkxO_CQP69WphPepjs9A 5 0 25347653 0 22.5gb 22.5gb
green open logstash-2017.10.20 nElW2tKDTWWTHzuIINRhtA 5 0 12520805 0 9.3gb 9.3gb
green open logstash-2017.10.28 AH2coafTQ3itzWQkfm802w 5 0 8973639 0 6gb 6gb
This is likely a bad thing. How many shards are there on this single node? How big is the heap? Based on the sizes of the indices I'm seeing here, you likely have a very high number of shards on this single node. If you have a heap of 30G, and have more than 600 shards on that single node, you will begin to see memory pressure in your cluster. This will affect everything, from indexing, and even to other cluster update actions, like trying to normalize the cluster state after deleting. This is because each open shard has an overhead cost in heap memory, regardless of how much data it contains.
You would do well to switch from daily indices to rollover indices, and not rollover until each shard in the index is over 10G. With a single node, you also probably shouldn't have more than 2 shards per index.
Oh, my. That's definitely outside our recommended best practices. There are many reasons for this, one of which is extremely long garbage collection pauses, which could cause the cluster to stall. This also could be what you're encountering.
The following command will tell you the total number of primary shards:
Thanks. The million dollar question is, "What was that number before you ran Curator?"
What is in the Elasticsearch logs of this single node? There are almost certain to be some errors, if space isn't being freed. Do you have any kind of monitoring in place, so we can see what is happening with the heap, and garbage collection?
Unfortunately without a time machine I don't think I have any way to tell what that number was before curator ran.
As far as errors in the logs, I've been watching them all afternoon while I look into this and I haven't really seen any errors. There are informational messages like so:
[2017-11-20T16:04:09,230][WARN ][o.e.c.r.a.DiskThresholdMonitor] [lGLn5hx] high disk watermark [90%] exceeded on [lGLn5hx4TyKKRTRMqT1VDQ][lGLn5hx][/var/lib/elasticsearch/nodes/0] free: 6.9gb[1.1%], shards will be relocated away from this node
[2017-11-20T16:04:09,230][INFO ][o.e.c.r.a.DiskThresholdMonitor] [lGLn5hx] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2017-11-20T16:04:16,531][INFO ][o.e.m.j.JvmGcMonitorService] [lGLn5hx] [gc][3164] overhead, spent [259ms] collecting in the last [1s]
If I stop logstash from running then the messages about garbage collection subside. I'm guessing this means that without logstash running elasticsearch is able to "catch up" but it still doesn't seem to be freeing up any space. I left it like this for a bit with logstash disabled and nothing seemed to change and nothing different was written to the logs.
I just roughly totaled up the sizes of our indexes and it does seem to match what is being used on disk. Maybe we've just grown the amount of logs we are ingesting and we can no longer keep as much as we used to be able to. I may have to pay closer attention to this moving forward.
Again, I recommend using rollover indices, rather than dailies. It will help reduce the size of the cluster state by not needing to have so many open shards. With only a single node, there's no reason to have 5 shards per index.
I will definitely look into that. Where might I be able to find more info about making the switch to that?
Also, let's say that you thought maybe something was "artificially" bloating the size of your index by either submitting duplicated log entries or data that doesn't belong in ES. How might you find that when you're dealing with a 20 gig index? I don't have any reason at the moment to think that we have that going on yet but it does make me worried that it could be happening.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.