Nesting high disk usage

(Benjamin Gathmann) #1

Hi there,

I have complex documents where I need a lot of nesting on more than 100 different nodes.
I first indexed my documents without nesting, which gave me a disk size ratio of around 1 : 0.4 (i.e. the ES index uses less than half the disk space of the original documents).
Next, I indexed the documents with nesting added, and quite shockingly, the ratio is now around 1:2.5 (so now the index is gobbling up extreme amounts of disk space).
Can somebody guess and explain what is going on?

(Benjamin Gathmann) #2

Naturally, the suspicion is that the large number of nested objects leads to the increase in disk usage. My original 562 documents result in 10,321,880,766 (that's 10 billion, yes!) documents on my node.
To get a better picture, it would be helfpul to know how many of each nested type I have. The Indices Stat API only gives me the total number of docs.
Is there a way to get a break-down of these numbers by nested document path?

(Benjamin Gathmann) #3

Sorry, the 10 billions were wrong, that was size_in_bytes. :wink:
The docs count is actually 14,183,108 (so still considerably high).

(Benjamin Gathmann) #4

Hello again,
I have played around with setting index.codec to best_compression, but this did not have any effect.
Other approaches I have figured out from "The true strory behind Elasticsearch storage requirements (part 1+2):

  • disable doc_values (I have loads of not_analyzed strings)
  • set index:no for any field that I probably will not search on
    What would others recommend?

(Benjamin Gathmann) #5

After further testing, there is just one thing I really don't get:
Why do nested documents use so much disk space?

Please, somebody give me a clue.

(Benjamin Gathmann) #6

I have received a helpful reply on Github, see here:

Of course, further comments and details are welcome.

(system) #7