Too many smaller indices. shards is creating issue

Hello all,

I have a requirement where we have 10s of thousands of smaller indices (~10-20mb).
Now the issue is each index is creating a 2 shard(1 primary & 1 replica).
So for 10s of thousands of indexes, it will be a huge investment, since 1 node can only hold 1000 shards as such.

Also, the disk size is very less.
We have 280GB of disk out of which only 1GB is used but we have 700 indexes and all shards used up. Technically we are not able to create more indexes on this node. We need to scale up.

Any solution to this?
Please help!

Hello Jay,

Which version are you using? Elastic 8.x made a great improvement regarding shard sizes ( see here):

As a general rule of thumb, you should have fewer than 3000 indices per GB of heap on master nodes. For example, if your cluster has dedicated master nodes with 4GB of heap each then you should have fewer than 12000 indices

Best regards
Wolfram

Is it similar data for each index?

Hey @Wolfram_Haussig

Thanks for your reply and time!

I am using 8.15.2

Yes, it has made an improvement, but my problem is I have too many small indexes (~25000) which are of (10-20mb).
Now if one index take one shards, I will need ~25000 shards, which will require 25 nodes.

The cost part will be pretty high!
Please let me know any workaround or tactics in such situation.

Hi @dadoonet

Thanks for time and reply!

No, we have massively different data for each index.

25k indexes is pretty high. What is your usecase behind so many indexes, maybe there is a better way? Are thos timebased indexes?

Hello @Wolfram_Haussig

My cluster have 16GB of heap memory, but still I am facing shard error.

Validation Failed: 1: this action would add [2] shards, but this cluster currently has [1450]/[1000] maximum normal shards open;')

Am I doing something wrong?

Use case is demanding to have (~25000) massively data different index. Use case related to construction documents.

NO, those are not time based indexes. These are pure text based indexes.

Even though there have been improvements in recent versions Elasticsearch does in my experience still not necessarily scale well with large number of small indices and shards, especially if mappings are all different. It would help if you could provide some additional information about the data and the use case so we can better judge if there are any worksrounds or improvements. What I would like to know is:

  • Is this a fixed or reasonably static number of indices or is it likely to increase over time?
  • How large and complex are the documents? How many fields do they have on average? Are you using nested documents? Are you using different types of analysers?
  • Are mappings controlled or dynamic?
  • Are indices queried one at a time or do you query groups of indices? If you are querying groups of indices, do you have any set of common fields?
  • How do you query the data? Do you primarily filter based on fields or do you rely on relevance and scoring?

Hello @Christian_Dahlqvist

  • Is this a fixed or reasonably static number of indices or is it likely to increase over time?
    -- It will increase over time to more indices.

  • How large and complex are the documents? How many fields do they have on average? Are you using nested documents? Are you using different types of analysers?
    -- documents are pretty simple with 10 text fields. Yes one field nested. Standard analyzer used.

  • Are mappings controlled or dynamic?
    -- Using static mapping. Same for all indexes.

  • Are indices queried one at a time or do you query groups of indices? If you are querying groups of indices, do you have any set of common fields?
    -- Group query are highly frequent.

  • How do you query the data? Do you primarily filter based on fields or do you rely on relevance and scoring?
    -- Both.

If that is the case I do not see why you could not add a field to indicate which index/group the document belongs to and store all data in a single index and just add a filter on this when querying. You mentioned that you have massively different data for each index which made me believe mappings varied a lot, but if the mapping is the same and static the nature of the data does not necessarily matter.

Why do you think you need that many indices and can not use a single one?

Document might get frequent updates too. HAVING separate seems to be a good choice here.

Also out of 10 fields, first 4 fields are indexed during creation and rest 6 fields are updated.

How update will behave on single vs separate?

Last resort is what you just mentioned. In fact that is what we thought from start.

Or you could group the 25 k logical sets of data into more than just one physical index, perhaps into 10 or 100 ?

Why would this be the case? If you were using a single index you would likely have a number of primary shards and may even use routing based on the group field being added. If you are using bulk requests for updates it may help to have fewer shards as you will get fewer larger writes.

So this was wrong. It's similar type of data.

So you can put all your data within one index. Use an attribute to define the owner of the data. Use a routing key which uses the same value so all data will go to the same shard for a given user.
And add a filtered alias which has the filter and the routing key.

Are large number of filtered aliases not also an issue even though it is not explicitly capped? Would it not be better to add a terms query for the new attribute to each query?

Okay, yeah, I meant by the type not the mappings.

Hey @Christian_Dahlqvist

Can you please explain this in a little more detail?

Let's say, I have a big index with GB of data (all indexed with routing-unique) and have 5 shards.

Now, everyday, I need to do bulk update on some mbs of data, will this take time/load on cluster or is there better way to do this operation?

Thanks!

Updates always put load on the cluster, irrespective of how many indices you have. If you use routing and can group updates related to the same group into bulk requests (a bulk request can handle multiple groups) only a limited number of shards will be updated per bulk request as you know all documents related to a specific group reside in a single shard (together with data from other groups). This should be reasonably efficient and likely not any worse than your current approach. It may even be better.

1 Like

Yeah it is in term of the cluster state size but it's definitely much better than a huge number of indices. :wink: