I'm evaluating whether to use Elasticsearch as an OLAP backend for our Reports+Dashboards feature
We have timeseries data that's ingested for different customers
The data between these customers is 100% independent. OLAP queries will always be made within a given customer's dataset.
From a modeling perspective, on paper it seems the most performant way to structure this data would be to use a data stream per customer. However, this will result in a large number of indices and shards, at least one per customer. This could number in the thousands, vs a smaller number if the data were collocated. Is there overhead per index/shard that would make this approach prohibitive?
Alternatively, we could put all customers in one data stream, but will this scale for large aggregations?
Or possibly using a fixed number of data streams and hashing the customers into one. But then it will be difficult to "rebalance" them down the line?
What's the recommended way to model something like this?
Use a single index for everyone, with tags for each customer and document-level security to manage access control
There is a 3rd mixed option, where you have larger customers in their own indices, and merge the smaller ones into a single index. That's a little more complex as you need something that manages this allocation etc.
With both of these you will want to deploy a hot/warm/cold architecture as well. You will also want to use ILM.
Appreciate the quick reply! I think we'll likely need to divide customers into different data streams, as if they're all in one, we'll likely end up with too many backing indexes (making reads slow).
So it sounds like we'll need to come up with a clientside routing scheme for indexing this data.
Even if we do some partitioning scheme, it's likely that customer usage patterns will change over time and necessitate migrating between different indexes.
Is there a best practice around how to migrate a customer's data from one index to another (without any gap in data)?
Best practice in my opinion is very subjective to the use case / requirements. There a number of ways to move data around indices (reindex, using ingestion tools like logstash etc) and some different strategies (combining, new index and shift alias etc)
I think really any of these designs / strategies / approaches need to practical and not overly complex to start, especially if the data a new use case and will grow ... There is a fine line between good planning, design and architecture and iteration as the use case grows which you seem to already be thinking about.
Routing to indices I think is Ok, I would not start off by trying to do custom routing to shards (not that you brought that up ...but it may come up at some point, often that does not work out well)
Sure. If we use a data stream to house this data, we'll have a rollover policy that limits the backing index size to ~50GB.
The amount of data across all customers will be a few TB. So per my understanding we'll end up with tens to potentially a hundred or more indexes backing this data stream (50GB per index, 20 per TB of stored data).
Anytime we issue a read query here, it will be routed to all indexes. Great for parallel work, but now query times are bound by whatever the slowest responding index is. I'm doubtful that query times would be optimal with this level of fragmentation, but just a feeling.
Do you know of people using a single data stream at this scale?
Just to reiterate, ILM is what you should be using here. And if you're going with a 50GB index, you only need max of 2 primary shards, anymore and you are likely wasting resources.
If you're suffering performance issues, it might be worth making another topic.
Yes, I'm familiar with ILM. It will automatically create a new index every 50GB of data (if configured as such). I have multiple TBs worth of data.
So if all this data is in one data stream, I'll end up with hundreds of indexes needing to be created to house the data, thus hundreds of indexes being read per query.
To clarify, my reads can span across the whole time range, not just the latest data.
It depends on "What Problematic Is?" what is the SLA for those Lifetime Queries is 1s, 10s, 1min OK? ...or OK for a small subset of queries? Elasticsearch querying against many shards is not necessarily an issue in a well designed system but there are trade offs as with any system.
We hear this quite often, and then often find in practices that 90% of the queries are a shorter time span (not saying your will be), so what do we want to design around? Again for 5% of the queries is a longer SLA ok?
And of course this goes back to how you partition up your data, what is the lifetime size of the data? How different will it be per customer? Are All customer access / queried the same? ... what is the lifetime at the beginning are you going to back load data?
Example I had ~3500 Customers
The top 100, the Next 400, and the last 3000 turned out to be a good way to partition my data, that came with time and experience... I did not know that up front... I guessed something close-ish but I refined over time. I also learned the access patterns over time and made sure that I spent most my time optimizing my top 100.
50G per Shard is a suggestion not a hard rule... perhaps some of the customer data could be pushed up to 80-100GB if it is not accessed as frequently or has a lower SLA (again just examples)
And BTW we have simply amazing consultants that live and breath this stuff everyday if you really wanted some detailed use case help ... but me... I would get started and dig in.
At this point it is theory, math, discussion my suggestion is you set up a small cluster and get some data loaded and start doing some benchmarking.
In my (personal) experience 45 GB is the ideal sweet spot. I'd also recommend on rolling over on 75 mil docs. Had several data sources with indices of 20-25 GB, but 100+ mil docs which had very bad search performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.