ES5 vs ES2 index size increase

The only changes I've made to my indexes migrating from ES 2.4.4 to ES 5.3.0 are such mappings upgrade (rest of the fields are compatible with 5.x):

{ x: string, index: not_analyzed } => { x: keyword }
{ x: string, term_vector: yes } => { x: text, term_vector: yes }

Reindexing from scratch I've got such stats:

dataset1=560.392 docs.
ES 2.4.4 index size=99.7M
ES 5.3.0 index size=104M

dataset2=2.583.604 docs
ES 2.4.4 index size=623M
ES 5.3.0 index size=662M

Is it a general rule: 5.x index size is larger than 2.x one for the same docs?

Mby it matters: 2.4.4 comes from the official deb repository, 5.3.0 comes from the official docker image (I mean how they were configured etc).

A first guess is doc values.
They are generated when you have keyword type. It's not identical to not_analyzed actually.

I recall more data has been migrated to doc_values in 5.x compared to 2.x, which means that the index size in 5.x, depending on your mappings, may take up a bit more space. When I tested it on a sample data set a while back I think it was in the range of 3-5%, but your milage may vary.

All relevant fields have doc_values explicitly disabled or enabled, no changes were made to this during migration.

Thanks, I guess that's the reason. I have some fields with doc_values enabled indeed.

Have you run a force merge on these indices to ensure they have the same number of segments?

I've just figured out indexing on 2.4.4 was done into 1 shard and it was done into 6 shards on 5.3.0.

After 5.3.0 reindexing into 1 shard and doing _forcemerge here are the updated stats:

dataset1=560.392 docs.
ES 2.4.4 index size=99.7M
ES 5.3.0 index size=100M

dataset2=2.583.604 docs
ES 2.4.4 index size=623M
ES 5.3.0 index size=630M

5.3.0 is a bit higher still but the difference is really small.

Thanks for your suggestions!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.