Single index has multi shard or multi index has single shard

assume all data are same, data count is 100,000,000

single index has multi shard( plan A)
multi index has single shard( plan B)

both without replica


when indexing,

plan A

  • single index name INDEX_LOG has 5shard
  • when index finished, each shard may contain 20,000,000

plan B

  • with rollover api with condition max_docs = 20,000,000
  • and then shrink to only one shard
  • when index finished, 5 indices, each has ony one shard, each shard cotain 20,000,000 docs

  • I know that, both contain 5 shards(lucene index), and the two plan are same when do seaching
  • when indexing, the write index has 5 shard, I think the perform are same , or plan A maybe just a little slow than plan B
  • does these two plan has any big difference at other scenes, which one should I choose
    thank you

What kind of data is that? Time based data like logs, sensors, ...?

I guess it is because you mention the rollover API which is often used in that case.

At index time, plan A will perform better than plan B.

For plan B you said "shrink to only one shard". Do you mean that for Plan B you will use at index time more than one shard, let say 4 so 5m docs per shard, then after Shrink 20m docs for a single shard. Is that right?

I'd personally go to that route. So a mix of plan A and plan B:

  • Create index with x shards
  • Rollover the alias
  • Shrink the index to 1 shard

I recommend that you look at this video from @pmusa which explains all that in details.

HTH

hi david, thank you very much for your reply.

in both plan, at index time, the write index has 5 shard

in plan B, I use alias named write_indexName, and rollover it , and then shrink the old index to only one shard.

and that you said "a mix of plan A and plan B" is the same what I mean plain B.

so, I donnot know, what "shard" means for a BLOCK_WRITE index (like just time based data, only write new data),
maybe we only need multi "shard" at index time ? we can use smaller index replace the use of "shard" ?

I don't understand this part.

I means that, like time based data such as logs, we only write and never modify.

maybe we only need multi shard at index time ?
and then shrink it to ONE shard ?

at search time, we can split a big index that contain multi shard to multi index that contain only one shard?
can I say, there is NO NEED multi shard at search time?

maybe we only need multi shard at index time ?

Yes if you have a loooooot of documents to index per second and one shard is not enough, then use more than 1 shard for hot indices.

Then measure what is the ideal shard size for search when no more index operation are made on your index. Shrink to that number.

But you will only know that by running tests as suggested per the talk I linked to.

In this talk, we are saying for example that for our use case one single shard can hold 45gb of data. What is your number?

How much data you will have per day?

I have multi index, that index data from 0 to 100000000 per day per index,
at beggin, I set 200000000 docs to rollover ( 1 replica ), 500G per index, 250G for primary, 6 primary shards
and it hard to prepare shrink
and it hard to recovery when it happen node disconnected. ( 3 node)

now I change it to 80000000 to rollover, and 200G per index, after shrink, 100G per shard.
may be too big?

4400 docs/second. and cdn log is too big witch contain session field.


and another, my node often load heavy when search or shrink ( not SSD ), it may cause node disconnected,
and it cost 12+ hours to recovery to green ( maybe it disconnected again at recovery time ), many old index set BLOCK_WRITE, but when disconnected for about 20s, all data recovery from remote than local, can you give some advice for this?

I’d keep shard size between 20gb to 50gb.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.