I've been trying to find ways to optimize my index size in Elasticsearch. In order to understand the size impact of the _source field, I dumped an index and re-indexed into another index with _source {enabled:false}. The original index was 466.9mb. I was somewhat shocked when the index with _source disabled actually resulted in an index size of 467.1mb... a slight increase in size. How is this possible?
I verified that the index/_mappings show _source disabled, my search results don't include _source and when I try to ask for _source=somefield, I get an error message. So, I'm fairly certain that the _source is actually removed in my test index.
I'm not sure that I actually want to remove _source in my production environment, but I'd really like to understand the storage optimizations that make it possible to store _source without adding additional size to the index.
I wouldn't remove source, it's something that I think it going to be deprecated in future as it severely limits other functionality, so you're likely on the backfoot if you do this. (See related warning here - _source field | Elasticsearch Guide [7.14] | Elastic
It'd help if you shared the mappings, and sample docs, so we can take a closer look.
Thanks, I wasn't planning on removing _source, but wanted to understand the impact that it had on index size. I found my answer in this issue: https://github.com/elastic/elasticsearch/issues/41628#issuecomment-488155381. If source is disabled and soft deletes are enabled, then elastic will automatically stuff the source into a stored field called _recovery_source. This makes it appear that disabling source does not reduce the size of the index since soft deletes are enabled by default.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.