As part of preparation for moving from Elasticsearch 6 to 7, I've reindexed an index which the Kibana Upgade Assistant flagged as "Index has more than one mapping type". The reindex completed without apparent error, the new index has the same number of documents, but the amount of data in terms of GB is much less. Kibana monitoring shows the original index as 27.0GB and the new one as 19.6GB, which is unnerving. I was thinking maybe the size might be less due to there only being one document type, but the reduction in data size of 25% seems large. Should I be concerned by this?
Both indices have the same number of shards and replicas. I reindexed with:
If you have the same document count in both the old and the new index they are the same. A re-index operation will either succeed or fail for each document so you're not going to find partial documents in the new index. So the size difference on disk should have nothing to do with missing data.
There are a couple of common reasons why the disk usage may differ:
Mapping differences. The mapping determines how each field is stored, so if the mapping differ between the old and the new index, they will naturally use a different amount of disk.
Deleted documents. In the old index you may have many deleted documents, and they still take up disk space. When you re-index to a new index, the deleted documents are ignored. So the new index will have zero deleted documents and thus take up less disk.
There is also a third reason, which probably doesn't come into play in your case, and that is segments and merges. Each shard in an index is built up by immutable segments containing 1 or more documents. Over time Elasticsearch will merge the smaller segments into bigger ones as this makes searching more efficient. This will usually save disk space too because of a relatively large block size overhead for the small segments.
We never delete documents, so size reduction isn't due to deleted documents being ignored.
Looking at mappings for both indices, I noticed the original index has
"_doc" : {
"_all" : {
"enabled" : true
},
but the new index doesn't. Which I assume is because he _all field cannot be enabled in indices created with 6.0+ _all field | Elasticsearch Guide [6.8] | Elastic
So I guess that accounts for the new index using so much less disk space.
Our current index templates don't enable _all and I don't think we ever explicitly enabled it. It seems like it was enabled by default. The documentation for 5.6 says
The _all field is not free: it requires extra CPU cycles and uses more disk space. If not needed, it can be completely disabled or customised on a per-field basis.
where as the equivalent bit of the 6.8 documentation says
The _all field is not free: it requires extra CPU cycles and uses more disk space. For this reason, it is disabled by default. If needed, it can be enabled.
Beware that an index may still contain deleted documents even if you don't do explicit deletes, because of the way the underlying Lucene engine works. If you update an existing document or index a full document but reusing an old id, the old version of that document is flagged deleted but remains in the index until a segment merge takes place (or a re-index).
You can view the number of deleted documents by calling the index _stats end-point. E.g.
You're probably right though, that the removal of the "_all" mechanism before ES6 is the culprit. I remember having to rewrite a few mappings to restore that "search-in-all-fields" functionality in my post-ES5 indices
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.