Indexing strategy for sparse data


I have a design-related conundrum, and I'm hoping somebody might have already come across the same problem.

We have set up ELK to ingest logs from a variety of sources. These come in different types (such as linux logs, nginx logs, etc) -- I will call these groups. Metrics will be parsed out of these logs. A minority of metrics are common and will be normalised across all groups, but most of them will be group-specific.

We have as a requirement to have visualisations of these metrics. norms are not required on these fields, but I believe doc_values are (see below). The number of groups is expected to grow as the service gains acceptance in our enterprise.

I'm not sure how to best set up our indexing strategy, though. I have the following alternatives:

  1. Use a separate index series for each group (for example log-linux-*, log-nginx-*). This produces dense data in each index, however the number of shards/indexes on my hot nodes will scale with the number of groups -- which may become a problem in the future.

  2. Use a common index for all data. This will produce a sparse data set but the number of shards will be minimal. I'm not sure how big of a problem this is, as far as I'm aware the only real problem is wasted disk space, but maybe you are aware of others?

  3. Use a common index for all data, turning off doc_values for all fields except the normalised minority. I'm not sure if this will result in the metrics being unusable for visualisations. As far as I understand the doc_values are needed for all aggregations, so even an average would not work which would make a simple line graph over time impossible. Please correct me if I'm wrong.

  4. Some other strategy I haven't thought of.

I appreciate any thoughts or pointers you might have.


This depends on the data volume, but having lots of small indices and shards is inefficient. If you have a long retention period you can however get around this by switching to e.g. monthly indices.

This is a common approach, and what Beats do. While its can lead to increased size on disk in Elasticasearch 5.x, this is being improved in the upcoming Elasticsearch 6.0 release. As 6.0 is not far away, this is probably the approach I would recommend.

Thanks Christian, that helps a lot.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.