Optimize for minimal storage space with many tiny documents

Andre_Hansel · November 29, 2015, 12:44am

I want to store a lot of data points to run aggregations on them. Here's my index definition:

mappings: {
  datapoint: {
    properties: {
      timestamp: { type: 'date', format: 'date_time'},
      value: { type: 'float', index: 'no', store: false, doc_values: true },
      metric_name: { type: 'string', index: 'not_analyzed', store: false, doc_values: true }
    },
    _all: { enabled: false},
    _source: { enabled: false}
  }
}

The two queries that I'd like to run on this data are: "For timestamp in range a-b and metric_name c, what is the sum of value?" and "Which metric_names do we have?"

There are 800 different metric names, each one about 80 bytes long. Since the cardinality is so low, I was hoping for a good compression ratio. There are 8 byte per document of real (incompressible) data, plus the ID which is auto-generated by Elasticsearch.

Right now I am looking at about 3,5 GB for 100M documents of this type, i.e. about 37 bytes per document. Since I am creating a few million new datapoints per day, I'd like to know if I can get the storage requirements further down. Is there anything I can optimize further?

warkolm · November 29, 2015, 1:24am

What version are you on?

AndreKR · November 29, 2015, 10:57am

The machine I was testing with has 1.7.3, because I wanted the Sense plugin for easier testing. I could upgrade to 2.0.

AndreKR · November 30, 2015, 7:48pm

I upgraded to 2.1 and set codec: best_compression and now I am at 25 bytes per document. That's already better but of course I'd still take suggestions to go down even further.

ddorian43 · March 14, 2016, 11:45am

Can you store the metric-name in a database and then reference metric.id which will be a 8byte integer ?

AndreKR · March 18, 2016, 2:19am

I have only about 1000 unique metric names. If replacing them with an integer values helps, there must be something seriously wrong with Elasticsearch's index format.

Christian_Dahlqvist · March 18, 2016, 5:43am

Are you letting Elasticsearch assign the IDs of the documents? If you know your data very well, you might be able to generate a unique key at the application level that is more compact and save space that way. When designing a key, it is worth reading the advice in this blog post.

ddorian43 · March 18, 2016, 5:48pm

Make sure it's a short integer then. Also shorten field names to 1 character and report back if the size shrinked.

Topic		Replies	Views
Can ES not to store original keyword content but a mapping num value (4 save space)? Elasticsearch	4	1073	February 17, 2020
Reduce size of index which contains integer fields only Elasticsearch	4	1212	July 5, 2017
Elasticsearch store time series data when there are many properties fields (about 60000 fields) Elasticsearch	5	384	March 19, 2020
Understanding Storage Overhead in Elasticsearch for Vector Data Elasticsearch vector-search	3	79	August 5, 2024
Elasticsearch Compression ratio Elasticsearch	6	20121	August 15, 2017

Optimize for minimal storage space with many tiny documents

Related topics