This metric double-counts disk space for hard-linked files, such as those created when shrinking, splitting, or cloning an index.
I can the API GET _tasks?detailed=true and don't see any re-index or any other tasks. The only tasks that are listed seem to be of cluster:monitor/tasks/.
disk.indices is the total of the store sizes of each index (i.e. the sum of the sizes of the individual files), whereas disk.used is whatever the OS reports as the used space on the underlying filesystem. In particular that means that disk.indices will double-count any hard-linked files whereas disk.used won't.
Thanks @DavidTurner . Would you be able to shed more light on hard-linked files here? How would they get created? AFAIK, this is a managed cluster so unlikely that anyone could manually create hard-links.
Is there a possibility that say shrink was triggered and then aborted mid-way leading to hard-linked files being present.
@DavidTurner do you still consider this a bug from user perspective and these confusing numbers being shown, despite the logic of the explanation of that behaviour?
Not really, these numbers measure different things and are both important as they are defined today. disk.indices is a good measure of the size of the overall dataset and is insensitive to things like hard-linking, whereas disk.used tells us how much real disk space we're actually using right now, accounting for hard-linked files, filesystem overhead (e.g. rounding up small files to a whole block), non-file space usage (e.g. directory entries), and other data also stored on the same filesystem.
Thanks for the clarification. Makes sense. But I've a few questions:
Is that why it double counts the hard-linked files?
I wouldn't be surprised if disk.used > disk.indices but in this case, the disk.indices > disk.used since last 6 consecutive days (and probably even more). If there were a re-index / cloning / shrinking task going on, it would be listed under GET _tasks?detailed=true. Correct?
But the output of GET _tasks?detailed=true just shows all tasks are of cluster:monitor/tasks/.
How do we find the root cause of disk.indices > disk.used in a managed ES Cluster where we cannot ssh ? Any thoughts?
Reindex yes, but that's not relevant. Cloning and shrinking are pretty quick, but will keep using hard-linked files for arbitrarily long, so you probably wouldn't see anything about them in the tasks.
Maybe this is the fundamental point: is there actually a problem here? Do you need to investigate?
Thanks David. I thought it was a problem until you clarified in a previous post. I was looking into the huge disk space occupied by this cluster and while disk.used showed 30TB, disk.indices reported 37.2TB. So I was puzzled which one to rely on since 7.2TB difference is a huge number.
So I suppose I can rest assure that this cluster uses 30TB and not 37.2TB. Correct?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.