To test the growth of index compared to ingested data, we made 1 million calls to insert the same document in an index. We observed linear growth in index size.
curl -X POST “http://elkelastic01.myhost.com:<my_port>/tests/_doc” -H ‘Content-Type: application/json’ -d’
{ “field1" : “This index has only one sentence, nothing more than that” }
'
As there is a reverse document index, the growth we anticipated was non-linear (plateau to be specific). What is missing ? Please explain.
Each document gets assigned a unique id, which takes up space on disk. You also need to keep track of the data, which is likely to be reasonably linear. Given that your actual data should be very compact it is possible that this is driving disk usage.
Adding that every document is stored in a _source field. Also you did not share the mapping. If you are using default, then a field1.keyword is more likely generated. It uses doc_values data structure which like column oriented data structure.
Is there any alternative to it or any way to reduce this space? Can we use an auto-increment id or something which is an integer and takes lesser space?
The size with _source field is 39287618 bytes and the size when _source is removed is 37506437 bytes. So, I don't think _source is the culprit here. Am I missing something?
Yeah that's some 4% more.
As _source is compressed, and because you have very similar documents that's probably why you don't see a big difference. That won't be the case with your final system though may be.
You can assign your own document IDs which may be shorter and take up less space. There is always going to be data stored per document, so even though Elasticsearch uses a terms dictionary and compresses the source the index size growth should be linear with the number of documents.
The size of the index we observed was 8.5 GB. We re-indexed the same index but removed _source. The size reduced to 4.7 GB. This restricted our search capabilities. So, we enabled store for fields. The size bumped up to 6.7 GB.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.