Sharding question

(Sam) #1

I have one index which has around 10K documents which may grow
max 20K documents in few years,And i have another index which has more than 20Million documents and likely to grow more than double or triple in a year.

how to pick my shards per index , this is for 3 node cluster all being master eligible [ node.master: true and]

for 10K document shall i have 3 shards and 1 replica and for 20M documents index should i have 5 shards and 2 replica's ? If so , when i wanted to scale horizontally and add more nodes what will be the effect? and what should be the configuration if i wanted to add let's say 3 or 4 more nodes?


(Mark Walkom) #2

What sort of data is it?

(Sam) #3

20 Million Documents data is mostly user views/ratings and 10K document is just the tags for documents.

(Mark Walkom) #4

You should use time based indices for reviews, given they happen at a specific time. That means you can start with a daily/weekly/monthly index with a small shard count, and easily scale.

(Sam) #5

If using Time based index , i end up having 365 or 52 or 12 indexes , having 365 indexes is it a good thing?
and if i have to search/aggregate using Transport Client using java for last 30 or 90 day's do i have to specific all the indexes while building the query? little bit newbie here , and how to create time based indexes?

(Mark Walkom) #6

Depends on your data volumes. I'd imagine for that 50 million size you can go with monthly with 1-3 primary shards and one replica.

I'd use the java rest client if you are just starting. The transport one will be deprecated eventually.

(Sam) #7

Does Rest Client has Aggregate Builders and all other features like pagination, Bulk Operations...etc that Transport client provides?

(David Pilato) #8

Not yet. But this is coming...

(Sam) #9

Thanks , we are going production soon and will be using Transport Client, In Which version are you going to deprecate Transport client?

(David Pilato) #10

I don't know. @javanna probably knows.

(Luca Cavanna) #11

we will most likely deprecate Transport Client in 6.0 , and remove it in 7.0 .

(Miguel) #12

Shards shouldn't exceed 50 GB (undersharding) and shouldn't be less than 5 GB (oversharding). In your case (considering that my average size for that amount of logs never exceeds 50 GB) I would go for 1 shards and 2 more replicas for taking advantage of the other two nodes.
Remember that replicas are not backups, but they can provide you HA.

