Linear growth of index size

To test the growth of index compared to ingested data, we made 1 million calls to insert the same document in an index. We observed linear growth in index size.

The query to load the docs was:

curl -X POST “http://elkelastic01.myhost.com:<my_port>/tests/_doc” -H ‘Content-Type: application/json’ -d’
{ “field1" : “This index has only one sentence, nothing more than that” }
'

As there is a reverse document index, the growth we anticipated was non-linear (plateau to be specific). What is missing ? Please explain.

1 Like

Each document gets assigned a unique id, which takes up space on disk. You also need to keep track of the data, which is likely to be reasonably linear. Given that your actual data should be very compact it is possible that this is driving disk usage.

Adding that every document is stored in a _source field. Also you did not share the mapping. If you are using default, then a field1.keyword is more likely generated. It uses doc_values data structure which like column oriented data structure.

Is there any alternative to it or any way to reduce this space? Can we use an auto-increment id or something which is an integer and takes lesser space?

The size with _source field is 39287618 bytes and the size when _source is removed is 37506437 bytes. So, I don't think _source is the culprit here. Am I missing something?

Yeah that's some 4% more.
As _source is compressed, and because you have very similar documents that's probably why you don't see a big difference. That won't be the case with your final system though may be.

I don't think so. This is not possible for now.

Is there value in considering this for a future release?

You can assign your own document IDs which may be shorter and take up less space. There is always going to be data stored per document, so even though Elasticsearch uses a terms dictionary and compresses the source the index size growth should be linear with the number of documents.

The mapping that was used is as follows:

{
 “tests” : {
   “mappings” : {
     “_doc” : {
       “_size” : {
         “enabled” : true
       },
       “_source” : {
         “enabled” : false
       },
       “properties” : {
         “field1” : {
           “type” : “text”
         }
       }
     }
   }
 }
}

We tried a different source for documents. We pushed 32.3m documents into an index. The mapping used is as follows:

“mappings” : {
     “doc” : {
       “_size” : {
         “enabled” : true
       },
       “properties” : {
         “@timestamp” : {
           “type” : “date”
         },
         “@version” : {
           “type” : “text”,
           “fields” : {
             “keyword” : {
               “type” : “keyword”,
               “ignore_above” : 256
             }
           }
         },
         “hostname” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “log_level” : {
           “type” : “keyword”
         },
         “message” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “package” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “source” : {
           “type” : “text”,
           “analyzer” : “pattern”
         },
         “tags” : {
           “type” : “text”,
           “fields” : {
             “keyword” : {
               “type” : “keyword”,
               “ignore_above” : 256
             }
           }
         },
         “thread” : {
           “type” : “text”,
           “analyzer” : “pattern”
         }
       }
     }
   }
 }

The size of the index we observed was 8.5 GB. We re-indexed the same index but removed _source. The size reduced to 4.7 GB. This restricted our search capabilities. So, we enabled store for fields. The size bumped up to 6.7 GB.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.