I know there's a high level of "it depends" when it comes to splitting up data into indexes and shards. I still want to share some thoughts to try and find the best separation for my data.
I currently have a single node with a large number of indexes and shards (thousands). It has been quite a long process over multiple configurations and major versions of Elasticsearch to get this reduced. My data is split into one or more indexes per customer I have. Indexes started with 5 shards each, then 3, now 1. One customer can have multiple indexes and these are not time-based. I want to boil this down to one index per customer and then some additional fields in the mapping to separate documents into categories. As a quick note, the data that is currently in multiple indexes are of the same type, so fits perfectly into one index, as long as I can include filtering on it on query time. When looking at the size of my indexes, none of them are above 2GB and most are in MB. For my "one-index-per-customer" strategy, I have been following the recommendations here: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
So, here are my concerns. I have been preparing the new data model and wanted to run some performance tests on it to make sure that I didn't break anything. I have been running the same test on three different models:
- Multiple indexes - one shard each
- One index - three shards
- One index - one shard
As explained above, number 1 is my current model and I was aiming for model 3 when the refactoring is done. For the test, I was indexing 250.000 documents. All three scenarios took around the same time to index. I then measured query performance on all three models. Query on model 1 searched inside 1 of the indexes. Query on model 2 and 3 searched inside the single index with a constant score filter. Searching inside model 1 was around 20% faster than models 2 and 3. Since there is no filter when searching in an index here, I can understand why this would be a bit faster. I was a bit surprised that sub-sequent queries were also faster than on the other two models since the filter result should be highly cached. As for queries in models 2 and 3, the performance was a little bit better on model 2. I'm running the test on a multi-core machine, why distributing probably helps out here.
What are your thoughts? I'm a bit disappointed that the model (3) I was aiming for, actually seems to have the poorest query performance of the three. It's the model that fits the recommendations best and reduces the number of shards the most. As a side note, my system indexes a lot more than it stores. So, index performance is of most importance. But the three tests had almost the same index performance.