High disk usage compared to before

Hi everyone,
We are trying out elasticsearch as one of the potential solutions for our search/insights needs. We are all setup with a small cluster with two nodes. For our initial test, we used a single node in order to estimate the amount of resources required (mainly disk space). In our initial test run we found that after indexing 290m entries (roughly 1/4 of our daily load), the disk usage was 102GB (out of 300GB total disk capacity). Based on those estimates we derived that we would require two machine in our cluster since we expect 4x that many records daily. So after setting up the two node cluster, we ran another test for much longer duration. After running the test for few days we have found that it's currently using 400GB of disk with only 315m entires. Each of our document is roughly the same size (the largest and smallest vary only by 5%), so it's really weird that the disk usage is so high. Could it be something to do with records not being evicted properly from disk? The test has been running for 3.5 days, with the node indexing roughly 300m entries every 24 hours. Each entry has TTL of 1day, so overall the number of entries should remain consistent. Any one know why it would be the case?

Using TTL to manage retention period is quite inefficient, and should generally be avoided if possible. Expiry due to TTL results in an explicit delete of each document following a periodic scan, which is in reality an update to indicate that the document has been deleted. As Lucene segments are immutable, the space on disk is not reclaimed until the segment is merged. This is probably why you are seeing increased disk usage.

The recommended approach for managing retention period is to use time-based indices. This avoids explicit deletes as the index is deleted as a whole through the APIs once all documents are no longer needed, which puts less load on your cluster compared to TTL. It is common to use daily indices, but if your retention period is just 1 day you may instead use hourly indices.

1 Like

Thanks for the suggestion. We will make the necessary changes to use time-based indices.

So we tried what you indicated, but still found the index usage is quite high. Before the 24hours data corresponded to 100GB (~285 million records) but now it's taking 40GB for every hour (~40 million records). That's almost 3x what we saw before with a single node. Also can we change the replication to 0? We are storing the data temporarily, and it's not a problem if in case of server failure the data gets lost. Would that reduce the usage?

Obviously yes! If you have more than one node.

1 Like

If replicas is now 1 (the default) then it'd half the usage. It'd also speed indexing quite a bit.

Its worth checking the files on disk and having a look at what is taking up space. The file types are listed here.

Sometimes doc values blow up. They use least common denominator and range tricks to encode themselves per segment. That means that sometimes it'll work very well but if you have data that varies quite a bit then it won't.

Its hard to tell from here though.

1 Like

Thanks guys. I will switch the value to 0 in order to reduce the disk usage. That should help with the space issue. It's just really weird that we see two different results since the document size is consistent every time. Either way , reducing the disk usage by 1/2 by changing replication factor to 0 should put it in line with our estimates.

Hi! You can try 2.0/2.1 and "index.codec": "best_compression" for time-based indexes. I faced to something similar: index size grows 2-4x time while indexing data, but after all merges and forced optimize it goes down to expected size.

Is it possible to automatically optimize indexes? We have hourly indexes which are not updated after the hour is over, so it would be merge the segments.

Try to do it manually and see if it helps. You can try Curator or custom nightly cron job (curl with _optimize/_forcemerge for last day (with wildcard index name)), Merging/optimizing is not immediate process and status acknowledged means that cluster accepted your task (queued it), so you have to wait some time before getting the result (say 5-10 minutes to see any effect),