Linear growth of index size

amitavmohanty01 · March 4, 2019, 12:39pm

To test the growth of index compared to ingested data, we made 1 million calls to insert the same document in an index. We observed linear growth in index size.

The query to load the docs was:

curl -X POST “http://elkelastic01.myhost.com:<my_port>/tests/_doc” -H ‘Content-Type: application/json’ -d’
{ “field1" : “This index has only one sentence, nothing more than that” }
'

As there is a reverse document index, the growth we anticipated was non-linear (plateau to be specific). What is missing ? Please explain.

Christian_Dahlqvist · March 4, 2019, 1:13pm

Each document gets assigned a unique id, which takes up space on disk. You also need to keep track of the data, which is likely to be reasonably linear. Given that your actual data should be very compact it is possible that this is driving disk usage.

dadoonet · March 4, 2019, 1:43pm

Adding that every document is stored in a _source field. Also you did not share the mapping. If you are using default, then a field1.keyword is more likely generated. It uses doc_values data structure which like column oriented data structure.

amitavmohanty01 · March 4, 2019, 7:26pm

Is there any alternative to it or any way to reduce this space? Can we use an auto-increment id or something which is an integer and takes lesser space?

amitavmohanty01 · March 4, 2019, 7:36pm

The size with _source field is 39287618 bytes and the size when _source is removed is 37506437 bytes. So, I don't think _source is the culprit here. Am I missing something?

dadoonet · March 4, 2019, 7:50pm

Yeah that's some 4% more.
As _source is compressed, and because you have very similar documents that's probably why you don't see a big difference. That won't be the case with your final system though may be.

dadoonet · March 4, 2019, 7:51pm

I don't think so. This is not possible for now.

amitavmohanty01 · March 4, 2019, 8:06pm

Is there value in considering this for a future release?

Christian_Dahlqvist · March 4, 2019, 8:26pm

You can assign your own document IDs which may be shorter and take up less space. There is always going to be data stored per document, so even though Elasticsearch uses a terms dictionary and compresses the source the index size growth should be linear with the number of documents.

amitavmohanty01 · March 5, 2019, 1:01pm

The mapping that was used is as follows:

{
 “tests” : {
   “mappings” : {
     “_doc” : {
       “_size” : {
         “enabled” : true
       },
       “_source” : {
         “enabled” : false
       },
       “properties” : {
         “field1” : {
           “type” : “text”
         }
       }
     }
   }
 }
}

amitavmohanty01 · March 5, 2019, 1:30pm

We tried a different source for documents. We pushed 32.3m documents into an index. The mapping used is as follows:

“mappings” : {
     “doc” : {
       “_size” : {
         “enabled” : true
       },
       “properties” : {
         “@timestamp” : {
           “type” : “date”
         },
         “@version” : {
           “type” : “text”,
           “fields” : {
             “keyword” : {
               “type” : “keyword”,
               “ignore_above” : 256
             }
           }
         },
         “hostname” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “log_level” : {
           “type” : “keyword”
         },
         “message” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “package” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “source” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “tags” : {
           “type” : “text”,
           “fields” : {
             “keyword” : {
               “type” : “keyword”,
               “ignore_above” : 256
             }
           }
         },
         “thread” : {
           “type” : “text”,
           “analyzer” : “pattern”
         }
       }
     }
   }
 }

The size of the index we observed was 8.5 GB. We re-indexed the same index but removed _source. The size reduced to 4.7 GB. This restricted our search capabilities. So, we enabled store for fields. The size bumped up to 6.7 GB.

system · April 2, 2019, 1:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Comparison between index size and doc source size Elasticsearch	5	1990	April 24, 2023
Document Count is same however index size is growing - How? Elasticsearch	8	4295	July 11, 2017
Indexing not proportional to document size Elasticsearch	5	802	July 5, 2017
Documents size vs. indexing size? Elasticsearch	2	1220	June 30, 2022
Fluctuating Index Sizes Elasticsearch	4	810	January 17, 2017

Linear growth of index size

Related topics