- I'm evaluating whether to use Elasticsearch as an OLAP backend for our Reports+Dashboards feature
- We have timeseries data that's ingested for different customers
- The data between these customers is 100% independent. OLAP queries will always be made within a given customer's dataset.
From a modeling perspective, on paper it seems the most performant way to structure this data would be to use a data stream per customer. However, this will result in a large number of indices and shards, at least one per customer. This could number in the thousands, vs a smaller number if the data were collocated. Is there overhead per index/shard that would make this approach prohibitive?
Alternatively, we could put all customers in one data stream, but will this scale for large aggregations?
Or possibly using a fixed number of data streams and hashing the customers into one. But then it will be difficult to "rebalance" them down the line?
What's the recommended way to model something like this?