After reindex, new index has same documents but data size is much less

mikewillis · July 29, 2020, 2:02pm

As part of preparation for moving from Elasticsearch 6 to 7, I've reindexed an index which the Kibana Upgade Assistant flagged as "Index has more than one mapping type". The reindex completed without apparent error, the new index has the same number of documents, but the amount of data in terms of GB is much less. Kibana monitoring shows the original index as 27.0GB and the new one as 19.6GB, which is unnerving. I was thinking maybe the size might be less due to there only being one document type, but the reduction in data size of 25% seems large. Should I be concerned by this?

Both indices have the same number of shards and replicas. I reindexed with:

  "source": {
    "index": "linux-hosting-httpd-access-2019-41",
    "size": 2000
  },
  "dest": {
    "index": "linux-hosting-httpd-access-reindex-for-types-2019-41",
    "type": "_doc"
  }
}

The original index was created with Elasticsearch 5.6.14. The new index has been created with Elasticsearch 6.8.9.

Bernt_Rostad · July 29, 2020, 2:28pm

No.

If you have the same document count in both the old and the new index they are the same. A re-index operation will either succeed or fail for each document so you're not going to find partial documents in the new index. So the size difference on disk should have nothing to do with missing data.

There are a couple of common reasons why the disk usage may differ:

Mapping differences. The mapping determines how each field is stored, so if the mapping differ between the old and the new index, they will naturally use a different amount of disk.
Deleted documents. In the old index you may have many deleted documents, and they still take up disk space. When you re-index to a new index, the deleted documents are ignored. So the new index will have zero deleted documents and thus take up less disk.

There is also a third reason, which probably doesn't come into play in your case, and that is segments and merges. Each shard in an index is built up by immutable segments containing 1 or more documents. Over time Elasticsearch will merge the smaller segments into bigger ones as this makes searching more efficient. This will usually save disk space too because of a relatively large block size overhead for the small segments.

I hope this answered your question

mikewillis · July 30, 2020, 3:07pm

We never delete documents, so size reduction isn't due to deleted documents being ignored.

Looking at mappings for both indices, I noticed the original index has

     "_doc" : {
       "_all" : {
         "enabled" : true
       },

but the new index doesn't. Which I assume is because he _all field cannot be enabled in indices created with 6.0+ _all field | Elasticsearch Guide [6.8] | Elastic
So I guess that accounts for the new index using so much less disk space.

Our current index templates don't enable _all and I don't think we ever explicitly enabled it. It seems like it was enabled by default. The documentation for 5.6 says

The _all field is not free: it requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled or customised on a per-field basis.

where as the equivalent bit of the 6.8 documentation says

The _all field is not free: it requires extra CPU cycles and uses more disk space. For this reason, it is disabled by default. If needed, it can be enabled.

Bernt_Rostad · July 31, 2020, 4:45am

Beware that an index may still contain deleted documents even if you don't do explicit deletes, because of the way the underlying Lucene engine works. If you update an existing document or index a full document but reusing an old id, the old version of that document is flagged deleted but remains in the index until a segment merge takes place (or a re-index).

You can view the number of deleted documents by calling the index _stats end-point. E.g.

curl -XGET 'http://localhost:9200/my-index/_stats' -s | jq '._all.primaries.docs'
{
  "count": 42352121,
  "deleted": 188788
}

You're probably right though, that the removal of the "_all" mechanism before ES6 is the culprit. I remember having to rewrite a few mappings to restore that "search-in-all-fields" functionality in my post-ES5 indices

system · August 28, 2020, 4:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
After reindex, new index has same documents but data size is much less Elasticsearch	4	575	July 21, 2021
Larger index size after Elasticsearch reindex Elasticsearch	9	2310	April 12, 2019
Index storage size growth on reindex Elasticsearch	1	643	April 26, 2018
Reindex find different storage size Elasticsearch	5	102	May 23, 2024
I got much more sizes than base index after reindexation! Elasticsearch	13	1303	July 6, 2017

After reindex, new index has same documents but data size is much less

Related topics