I have a requirement where we have 10s of thousands of smaller indices (~10-20mb).
Now the issue is each index is creating a 2 shard(1 primary & 1 replica).
So for 10s of thousands of indexes, it will be a huge investment, since 1 node can only hold 1000 shards as such.
Also, the disk size is very less.
We have 280GB of disk out of which only 1GB is used but we have 700 indexes and all shards used up. Technically we are not able to create more indexes on this node. We need to scale up.
Which version are you using? Elastic 8.x made a great improvement regarding shard sizes ( see here):
As a general rule of thumb, you should have fewer than 3000 indices per GB of heap on master nodes. For example, if your cluster has dedicated master nodes with 4GB of heap each then you should have fewer than 12000 indices
Yes, it has made an improvement, but my problem is I have too many small indexes (~25000) which are of (10-20mb).
Now if one index take one shards, I will need ~25000 shards, which will require 25 nodes.
The cost part will be pretty high!
Please let me know any workaround or tactics in such situation.
Even though there have been improvements in recent versions Elasticsearch does in my experience still not necessarily scale well with large number of small indices and shards, especially if mappings are all different. It would help if you could provide some additional information about the data and the use case so we can better judge if there are any worksrounds or improvements. What I would like to know is:
Is this a fixed or reasonably static number of indices or is it likely to increase over time?
How large and complex are the documents? How many fields do they have on average? Are you using nested documents? Are you using different types of analysers?
Are mappings controlled or dynamic?
Are indices queried one at a time or do you query groups of indices? If you are querying groups of indices, do you have any set of common fields?
How do you query the data? Do you primarily filter based on fields or do you rely on relevance and scoring?
Is this a fixed or reasonably static number of indices or is it likely to increase over time?
-- It will increase over time to more indices.
How large and complex are the documents? How many fields do they have on average? Are you using nested documents? Are you using different types of analysers?
-- documents are pretty simple with 10 text fields. Yes one field nested. Standard analyzer used.
Are mappings controlled or dynamic?
-- Using static mapping. Same for all indexes.
Are indices queried one at a time or do you query groups of indices? If you are querying groups of indices, do you have any set of common fields?
-- Group query are highly frequent.
How do you query the data? Do you primarily filter based on fields or do you rely on relevance and scoring?
-- Both.
If that is the case I do not see why you could not add a field to indicate which index/group the document belongs to and store all data in a single index and just add a filter on this when querying. You mentioned that you have massively different data for each index which made me believe mappings varied a lot, but if the mapping is the same and static the nature of the data does not necessarily matter.
Why do you think you need that many indices and can not use a single one?
Why would this be the case? If you were using a single index you would likely have a number of primary shards and may even use routing based on the group field being added. If you are using bulk requests for updates it may help to have fewer shards as you will get fewer larger writes.
So you can put all your data within one index. Use an attribute to define the owner of the data. Use a routing key which uses the same value so all data will go to the same shard for a given user.
And add a filtered alias which has the filter and the routing key.
Are large number of filtered aliases not also an issue even though it is not explicitly capped? Would it not be better to add a terms query for the new attribute to each query?
Updates always put load on the cluster, irrespective of how many indices you have. If you use routing and can group updates related to the same group into bulk requests (a bulk request can handle multiple groups) only a limited number of shards will be updated per bulk request as you know all documents related to a specific group reside in a single shard (together with data from other groups). This should be reasonably efficient and likely not any worse than your current approach. It may even be better.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.