Index significantly larger after reindexing

I've had to reindex some data in order to change the shard count on the index they were in: I went from having one shard on the old index to four shards on the new index. The old index (1 shard) was 38.1GB; the new index (4 shards) is 46.7GB; both index have the same number of documents. Is there any reason why just going from one shard to four would cause such a large increase in size (I did expect some per-shard overhead)? Both indices have are using the same template and field mappings.

This is on ES 2.0.

Thanks,
John Ouellette

1 Like

Probably doc values and a the re-sharding causing an increase in relative cardinality, which means compression won't be as good.

Hi Mark -- I understand the words, but not when they are put together like that :slight_smile: Could you explain that a bit more? Is a ~20% increase in the size of the index something to expect in this case?

Doc values are a column wise way of storing bits of the document for quick aggregations. They are compressed using greatest common divisor tricks across whole lucene segments. Lucene segments are the immutable chunks that make up the index. All operations in elasticsearch look like foreach shard {foreach segment{doSomeStuff} aggregate} aggregate.

Depends on how you did the reindex. Were there few segments before and many now? If so merging the segments will probably help. You can learn about the file types here and use that knowledge to count the segments.

If there are fewer segments then you are probably hitting a regression caused by worse compression on the doc values. You can figure that out by looking at the sizes of the files.

Its kind of hard to track down size changes, but looking at those files is the way I start. You might find something crazy like "I accidentally turn on term vectors in the new index" and that'll take up 40% more room, easy.

1 Like

Thanks Nik! (Sorry I didn't reply, was travelling :))

Thanks Mark and Nik -- I'll be doing a bit more reading, but that helps.