I want to store a lot of data points to run aggregations on them. Here's my index definition:
mappings: {
datapoint: {
properties: {
timestamp: { type: 'date', format: 'date_time'},
value: { type: 'float', index: 'no', store: false, doc_values: true },
metric_name: { type: 'string', index: 'not_analyzed', store: false, doc_values: true }
},
_all: { enabled: false},
_source: { enabled: false}
}
}
The two queries that I'd like to run on this data are: "For timestamp
in range a-b
and metric_name
c, what is the sum of value
?" and "Which metric_name
s do we have?"
There are 800 different metric names, each one about 80 bytes long. Since the cardinality is so low, I was hoping for a good compression ratio. There are 8 byte per document of real (incompressible) data, plus the ID which is auto-generated by Elasticsearch.
Right now I am looking at about 3,5 GB for 100M documents of this type, i.e. about 37 bytes per document. Since I am creating a few million new datapoints per day, I'd like to know if I can get the storage requirements further down. Is there anything I can optimize further?