I have been waging war on "Shards" to "node" ratio
I have something like 16 different indexes ranging from 4 shards to 30 shards, with a total BYTE count of about 1TB a day ( and trying to keep 30 days with of indexes ) I think right now I am at 7k shards for 15 days which as I am told is too many shards per node.
So I had a thought and wanted to see if someone had an opinion on it.
Would it be better to have ONE HUGE Index (1TB) and shard it to say 50 (which is the number of nodes I have) That would make it a 20GB Shard size, and then just create filtered Aliases?
I guess the first issue would be would be warm up of the filtered Alias,
Any other issues?
The more I write this out the more I think this might be a good directory to go.
It sounds like you are generating quite a few shards per day, with an average shard size of a few GB. Why not just change the number of primary shards per index so that you get an average shard size around the size you are looking for? A small index does not necessarily need to have a shard on each node in the cluster, and as you will still have a significant number of shards, indexing load should be quite well distributed anyway?
Just want to verify a few things before I provide some inputs on this
Based on what you said above, does that mean you have 16 different types of data with one index each. And for each index, you are trying to keep the data for 30-day? On the average, how many documents are added to each index per day?
In terms of HW, you have 50 data nodes in the cluster?
Alright to get more details: I figured I generalized
Right now we rung 4 different websites so I have a different indexer for
prod 100M docuemets per day
qa <1M documents
dev <1M documents
preview <1M documents
That is the 16 indexes for right now 10 days worth of indexes.
For a total of 4B Documents across 126 indexes (4.3K shards), with 50 Data Nodes - 16G Heap - 10 CPU's per node
As for the total count of TYPE's well it looks to be 62 right now, I just did a Unique count across all the indexers
What seems so far of a no-brainier, I think I would like to keep each production index separated
But make one "big" index for non-prod with lots of aliases on a field I have set to identify which website it is from.
I put it in to my QA environment and it seems to work almost as good as performance seems equal or better since now it can "Cache" more. I have to run some tests against it but I was letting it bake in with some real data.
I am just trying to figure out if there are any words of wisdom, bottlenecks, "Danger Will Robinson" or just good suggestions. Before I roll this to production
So yes for some of the small indexes they have scaled down to like 5 primary shards, on the large ones We were up at 60 but we have slowly dropped the numbers to 20 with little impact so far to searching and indexing. as we have to Index 10-30K messages a second.
Unfortunately we don't have a separate environment to test at this scale, so each tweak, I have to watch over weeks of time to gauge the change.
I think if I can merge say 12 different indexes in to one , this will 2/3 Reduction in the number of shards with little effort
my only worry is having to create 12 different aliases and determining if there would be any substantial impact. I am curious before I fully implemented on the production cluster if there are any warnings or things I should look out for before doing filtered aliases vs just individual indexes
Thanks for the info. With ES, in theory, a shard can hold up to 2B documents and in practice I normally try to keep around 250M max per shard when I estimate the number of shards needed per index.
With 100M documents per day -> in 10 days, you'll have 1B documents, and 30 days you have 3B documents.
With 50 nodes you have, let's say you create one index with 50 shards. To hold 3B documents per month, each shard will hold 60M documents. If the data is doubled, 6B documents per month, each shard needs to hold 120M documents. This means using one index with 50 shards is fine for handling 6B documents per month, you don't need to create an index per day with x number of shards. I would give this a try.
WRT to the others with 1M documents per day (30M documents per month), you can do similar calculation with a smaller number of shards, you don't need 50 shards for those indices.
When you have less number of shards, it reduces file I/O traffic significantly -> improve both index and search operations.
I suggest to test those small indices first with one shard or 3 shards max.
If you can max out the heap to 32GB assuming each node has at least 64GB RAM, that would help.
What is the average shard size for the different indices? The great thing about having daily indices is that you can change the shard count day by day and therefore gradually reduce it or even revert back if you go too far.
Yah I @thn thanks for your stats, I will have to keep some separation not only for backup purposes but also so that one business unit does not impact the others. Those Production Indexes are in the area of 400GB and during peak go up to 1.2TB Size but that being said,
That is a good calculation So I will see what it can Down Scale too.
I was hoping to have more of a discussion of "filtered Aliases" then shared scaling but sounds like the reason I want to use Filtered Aliases is not as much of an issue as my shard counts. It will take me some time to digest all of this information
Forgot to mention about the alias. A member of the team did an experiment with aliases and the performance was very bad when dealing with 100+ aliases... at 1K+, it's very bad. This is 1 index with 3 shards and 3 replicas with about 5 million documents (I don't think he did 1 billion documents)
I suggest you run your own test with your data and hardware to find the acceptable performance.
ok, will keep that in mind, though I am only talking 10 filtered aliases on something that probably will not be more then 100Gbytes of data.
The BIG indexes I plan to have separated indexes
So thanks guys been digesting what information you have provided, I used to understand it but would not accept the reality of it.
So this is what I have decided
- Non-Prod logs are so small that they don't warrent their own index. So I am dropping then in one Index,
- With this 1 index, I will have filtered Aliases to allow users to get to the ENV information quicketly
- Next, As it was a plan of mine anyway, I have Broken up my LARGE cluster to 5 different Cluster 1 for each website.
- this had 2 major benefits, Allow Abusive things to happen to one site and not affect the other, allow me to make changes without affecting everyone
- One Downside is that now I have to RIght Size the disk space per server
- I have Now that I reduces the size of cluster, I am also running 1/2 the amount of nodes. (was 4 per hardware now 2) I left the heap size the same, thinking I might change my mind later (12gb)
- Finally since the number of Data nodes are at "8" I reduced my primary shards to 4. I will play with this number as the environment burns in.
So basically I end up with the Following
5 Clusters - upgraded from 2.4 to 5.4.1
with 4 Hardware unites per cluster ( >100GB ram ) 40 CPU's
It is looking like 3 indexes will be hosted on each cluster 1 BIG one and 3 other smaller ones
I have set the Shard size to 4 with replication 1
And with 2 days of data we are Flying,
I think my I was Gifted too much hardware and I said, SURE lets USE/Abuse it.
@thn and @Christian_Dahlqvist your questions got me to think beyond and re-think what I was doing
Glad to see your new adjustment with your data.
Regarding the non-prod logs data, you can also use "document type" instead of aliasing if it fits your requirements. I will only put each log in its own index if it needs to have its own mapping.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.