Sharding question

I have one index which has around 10K documents which may grow
max 20K documents in few years,And i have another index which has more than 20Million documents and likely to grow more than double or triple in a year.

how to pick my shards per index , this is for 3 node cluster all being master eligible [ node.master: true and node.data:true]

for 10K document shall i have 3 shards and 1 replica and for 20M documents index should i have 5 shards and 2 replica's ? If so , when i wanted to scale horizontally and add more nodes what will be the effect? and what should be the configuration if i wanted to add let's say 3 or 4 more nodes?

Sam.

What sort of data is it?

20 Million Documents data is mostly user views/ratings and 10K document is just the tags for documents.

You should use time based indices for reviews, given they happen at a specific time. That means you can start with a daily/weekly/monthly index with a small shard count, and easily scale.

If using Time based index , i end up having 365 or 52 or 12 indexes , having 365 indexes is it a good thing?
and if i have to search/aggregate using Transport Client using java for last 30 or 90 day's do i have to specific all the indexes while building the query? little bit newbie here , and how to create time based indexes?

Depends on your data volumes. I'd imagine for that 50 million size you can go with monthly with 1-3 primary shards and one replica.

I'd use the java rest client if you are just starting. The transport one will be deprecated eventually.

Does Rest Client has Aggregate Builders and all other features like pagination, Bulk Operations...etc that Transport client provides?

Not yet. But this is coming...

Thanks , we are going production soon and will be using Transport Client, In Which version are you going to deprecate Transport client?

I don't know. @javanna probably knows.

Hi,
we will most likely deprecate Transport Client in 6.0 , and remove it in 7.0 .

Shards shouldn't exceed 50 GB (undersharding) and shouldn't be less than 5 GB (oversharding). In your case (considering that my average size for that amount of logs never exceeds 50 GB) I would go for 1 shards and 2 more replicas for taking advantage of the other two nodes.
Remember that replicas are not backups, but they can provide you HA.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.