Larger index size after Elasticsearch reindex

After performing a reindex on a 75GB index, the new one went to 79GB.

Both indexes have the same doc count (54,123,676) and both have the exact same mapping. The original index has 6×2 shards and the new one has 3×2 shards.

The original index also has 75,857 deleted documents which were not moved across, so we are pretty stumped as to how it could even be smaller than the new one at all, let alone by a whole 4GB.

Original Index

{
    "_shards": {
        "total": 12,
        "successful": 12,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "docs": {
                "count": 54123676,
                "deleted": 75857
            },
            "store": {
                "size_in_bytes": 75357819717,
                "throttle_time_in_millis": 0
            },
            ...
            "segments": {
                "count": 6,
                "memory_in_bytes": 173650124,
                "terms_memory_in_bytes": 152493380,
                "stored_fields_memory_in_bytes": 17914688,
                "term_vectors_memory_in_bytes": 0,
                "norms_memory_in_bytes": 79424,
                "points_memory_in_bytes": 2728328,
                "doc_values_memory_in_bytes": 434304,
                "index_writer_memory_in_bytes": 0,
                "version_map_memory_in_bytes": 0,
                "fixed_bit_set_memory_in_bytes": 0,
                "max_unsafe_auto_id_timestamp": -1,
                "file_sizes": {}
            }
            ...

New Index

{
    "_shards": {
        "total": 6,
        "successful": 6,
        "failed": 0
    },
    "_all": {
        "primaries": {
            "docs": {
                "count": 54123676,
                "deleted": 0
            },
            "store": {
                "size_in_bytes": 79484557149,
                "throttle_time_in_millis": 0
            },
            ...
            "segments": {
                "count": 3,
                "memory_in_bytes": 166728713,
                "terms_memory_in_bytes": 145815659,
                "stored_fields_memory_in_bytes": 17870464,
                "term_vectors_memory_in_bytes": 0,
                "norms_memory_in_bytes": 37696,
                "points_memory_in_bytes": 2683802,
                "doc_values_memory_in_bytes": 321092,
                "index_writer_memory_in_bytes": 0,
                "version_map_memory_in_bytes": 0,
                "fixed_bit_set_memory_in_bytes": 0,
                "max_unsafe_auto_id_timestamp": -1,
                "file_sizes": {}
            }
            ...

Any ideas?

I don't think these numbers are giving the full picture. I can't believe that 6 shards only have 3 segments between them, because this'd mean that at least 3 of the shards are completely empty.

Each copy of a shard will be a different size: Elasticsearch doesn't coordinate how segments are created or how they are merged. It's possible that the numbers we are looking at for the new index include some segments that are currently being merged, which takes up extra space. I think we will need to look at the individual copies to better understand what's going on:

GET /INDEXNAME/_stats?level=shards

or possibly simpler to eyeball:

GET _cat/shards/INDEXNAME
GET _cat/segments/INDEXNAME

Thanks. Here are the results. Its unlikely that there's a merge going on as this is a fairly old index.

GET _cat/shards/INDEXNAME
INDXNAME 1 r STARTED 9019394 11.7gb x.x.x.x Xo66i6f
INDXNAME 1 p STARTED 9019394 11.7gb x.x.x.x ZNhGnBb
INDXNAME 4 r STARTED 9021157 11.6gb x.x.x.x QadY73s
INDXNAME 4 p STARTED 9021157 11.6gb x.x.x.x RT1VsHj
INDXNAME 5 p STARTED 9018442 11.7gb x.x.x.x iDqmtDR
INDXNAME 5 r STARTED 9018442 11.7gb x.x.x.x 6J5KbTY
INDXNAME 2 r STARTED 9018402 11.6gb x.x.x.x xfg9Tzs
INDXNAME 2 p STARTED 9018402 11.6gb x.x.x.x lAQOiOZ
INDXNAME 3 r STARTED 9022920 11.7gb x.x.x.x yi-VARZ
INDXNAME 3 p STARTED 9022920 11.7gb x.x.x.x BOQd_CF
INDXNAME 0 p STARTED 9023361 11.6gb x.x.x.x lms6M6I
INDXNAME 0 r STARTED 9023361 11.6gb x.x.x.x bsMplAj

GET _cat/segments/INDEXNAME
INDXNAME 0 p x.x.x.x _46b 5411 9023361 12633 11.6gb 28936424 true true 6.6.0 false
INDXNAME 0 r x.x.x.x _46b 5411 9023361 12633 11.6gb 28936424 true true 6.6.0 false
INDXNAME 1 r x.x.x.x _467 5407 9019394 12743 11.7gb 28946279 true true 6.6.0 false
INDXNAME 1 p x.x.x.x _467 5407 9019394 12743 11.7gb 28946279 true true 6.6.0 false
INDXNAME 2 r x.x.x.x _45c 5376 9018402 12783 11.6gb 28930970 true true 6.6.0 false
INDXNAME 2 p x.x.x.x _45c 5376 9018402 12783 11.6gb 28930970 true true 6.6.0 false
INDXNAME 3 r x.x.x.x _47c 5448 9022920 12536 11.7gb 28970827 true true 6.6.0 false
INDXNAME 3 p x.x.x.x _47c 5448 9022920 12536 11.7gb 28970827 true true 6.6.0 false
INDXNAME 4 r x.x.x.x _49m 5530 9021157 12566 11.6gb 28948098 true true 6.6.0 false
INDXNAME 4 p x.x.x.x _49m 5530 9021157 12566 11.6gb 28948098 true true 6.6.0 false
INDXNAME 5 p x.x.x.x _418 5228 9018442 12596 11.7gb 28917526 true true 6.6.0 false
INDXNAME 5 r x.x.x.x _418 5228 9018442 12596 11.7gb 28917526 true true 6.6.0 false

This looks like your old index (6 shards) and it looks like it's been force-merged down to a single segment. It'd be good to see the same information for the new index too.

You said in the OP that you were looking at the result of a reindex, which forms a new index which may still be subject to merges. Have you force-merged the new index too or does it contain multiple segments?

Were the indices created with different versions of Elasticsearch? Do they have identical mappings? It's certainly possible that the on-disk representation has changed by a few %, particularly if the mappings are different.

new index: GET _cat/shards/INDEXNAME
INDEXNAME2 p STARTED 18036844 24.6gb x.x.x.x xfg9Tzs
INDEXNAME2 r STARTED 18036844 24.6gb x.x.x.x opps4O9
INDEXNAME1 r STARTED 18040551 24.7gb x.x.x.x _8678PW
INDEXNAME1 p STARTED 18040551 24.7gb x.x.x.x ZNhGnBb
INDEXNAME0 r STARTED 18046281 24.6gb x.x.x.x yi-VARZ
INDEXNAME0 p STARTED 18046281 24.6gb x.x.x.x 8IOt5Hb

GET _cat/segments/NEWINDEX
audit_201801_1 0 r x.x.x.x _hoj 22915 18046281 0 24.6gb 55592211 true true 6.6.0 false
audit_201801_1 0 p x.x.x.x _hoj 22915 18046281 0 24.6gb 55592211 true true 6.6.0 false
audit_201801_1 1 r x.x.x.x _hok 22916 18040551 0 24.7gb 55578324 true true 6.6.0 false
audit_201801_1 1 p x.x.x.x _hok 22916 18040551 0 24.7gb 55578324 true true 6.6.0 false
audit_201801_1 2 p x.x.x.x _hoa 22906 18036844 0 24.6gb 55558178 true true 6.6.0 false
audit_201801_1 2 r x.x.x.x _hoa 22906 18036844 0 24.6gb 55558178 true true 6.6.0 false

we did a POST /_forcemerge?max_num_segments=1 on the new index. and the Old one had the same thing done months before.

Hmm ok, it's the same version and it's also been force-merged, so it looks like we're making a fair comparison. You didn't say yet, but just thought I'd check that it's exactly the same mapping too?

Each segment contains more documents. I'm speculating, but I wonder if it's something like the document IDs taking up more space. IIRC Lucene numbers the documents sequentially within each segment, and in the old index those numbers would fit into 3 bytes because they're all less than 2^24, whereas in the new one the document IDs can't all fit into 3 bytes any more.

Can you clarify how exactly this is causing you an issue?

the mapping is identical. The document ID is interesting, though I struggle to see how that would cause a different of 4GB.

I checked with some folk that know a lot more about Lucene's internals than I do, and this seemed to be a plausible explanation. It depends on the details of the mapping, but increasing the size of a document ID by a byte (33%) could indeed explain this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.