Index Size: Elasticsearch 1.4 -> 2.1

We currently switched from ES 1.4 to ES 2.1. After creation of our indices with ES 2.1 the size of them are more than twice bigger than with ES 1.4. Any hints about this effect or has anybody made the same observation?

Probably because of doc_values which are activated by default?

And in ES 1.4 doc_values weren't activated by default?

Look at Elasticsearch 2.0 2.5X Disk Space seems to be same problem.

Thanks. I will check this.

No they were not.

Is it possible to disable doc_values by default as it was in 1.x? Or do we need to disable it on every field in our mapping now?

So you have not_analyzed fields but you don't use them for sorting or aggregations?

Yes you have to define this for every field but you can use Dynamic Templates.

Now i changed all our fields to doc_values: false. But it seems that it doesn't work... Index size with ES.1.4 = 69GB index size with 2.1 = 172GB. May you can have a look at our mapping? Is it possible to upload it here. Upload function only allowes jpg, jpeg, png, gif :frowning:

You can pretty format it and copy and paste here if small.

If not, paste it on gist.github.com

Please have a look at https://gist.github.com/anonymous/c0ab1a97d655322cde55

Is it the same mapping you have for your 1.4 version?
The mapping looks good.

Did you index exactly the same data?

The mapping is the same up to "doc_values: false" and under
"_source": {
"enabled": true,
"compressed": true
} i removed the "compressed": true cause it isn't supported anymore and compression is enabled by default. isn't it?

The data is exactly the same data and this is what makes me wonder. It seems that the doc_values: false configuration dosen't work....

@jpountz Any idea about this?

Alexandre, can you break down the size of your data directory by file extension for both versions? (eg. how much disk space are using the .fdt files, .dvm files, .tim files, etc.)

At https://gist.github.com/anonymous/120f63fbad5939febd92 you can find 4 files.

The first two files list the files and size inside the data directory of each es 1.4 node.

The last two files list the files and size inside the data directory of each es 2.1 node.

The main problem seems to be due to the fact that you have sparse analyzed string fields, ie. fields that are only present in a minority of documents. Norms were entirely stored in memory up to 2.0 included, which could occasionally take a lot of memory. In 2.1 norms have been moved to disk in order to reduce the memory requirements of Elasticsearch. However, while the new encoding requires much less memory (no memory at all actually), it also requires more disk space in the case that fields are sparse.

We can look into better compressing norms on sparse fields, but this is not something that would come for free, in particular performance would be affected, and this would take quite some time to be released.

However on your end, you could look into modeling your documents in such a way that you have fewer sparse analyzed string fields. Another option is to disable norms on your string fields if you don't need scoring.