Optimize for minimal storage space with many tiny documents


(André Hänsel) #1

I want to store a lot of data points to run aggregations on them. Here's my index definition:

mappings: {
  datapoint: {
    properties: {
      timestamp: { type: 'date', format: 'date_time'},
      value: { type: 'float', index: 'no', store: false, doc_values: true },
      metric_name: { type: 'string', index: 'not_analyzed', store: false, doc_values: true }
    },
    _all: { enabled: false},
    _source: { enabled: false}
  }
}

The two queries that I'd like to run on this data are: "For timestamp in range a-b and metric_name c, what is the sum of value?" and "Which metric_names do we have?"

There are 800 different metric names, each one about 80 bytes long. Since the cardinality is so low, I was hoping for a good compression ratio. There are 8 byte per document of real (incompressible) data, plus the ID which is auto-generated by Elasticsearch.

Right now I am looking at about 3,5 GB for 100M documents of this type, i.e. about 37 bytes per document. Since I am creating a few million new datapoints per day, I'd like to know if I can get the storage requirements further down. Is there anything I can optimize further?


(Mark Walkom) #2

What version are you on?


(André Hänsel) #3

The machine I was testing with has 1.7.3, because I wanted the Sense plugin for easier testing. I could upgrade to 2.0.


(André Hänsel) #4

I upgraded to 2.1 and set codec: best_compression and now I am at 25 bytes per document. That's already better but of course I'd still take suggestions to go down even further.


(ddorian43) #5

Can you store the metric-name in a database and then reference metric.id which will be a 8byte integer ?


(André Hänsel) #6

I have only about 1000 unique metric names. If replacing them with an integer values helps, there must be something seriously wrong with Elasticsearch's index format.


(Christian Dahlqvist) #7

Are you letting Elasticsearch assign the IDs of the documents? If you know your data very well, you might be able to generate a unique key at the application level that is more compact and save space that way. When designing a key, it is worth reading the advice in this blog post.


(ddorian43) #8

Make sure it's a short integer then. Also shorten field names to 1 character and report back if the size shrinked.


(system) #9