Need suggestion on sharding for efficiency


(Rahul Sharma) #1

We are trying to design search engine for our customers.
We have many customers, some have small data, some have medium size data and some have really big data.
We are trying to go for one index per customer.
Also for small customers we want less shards and for medium one and larger ones we want an optimal number of shards.
Can anyone suggest us what is the right approach for this? Please share your views it will be helpful.

Thanks in advance for your help.

Regards,
Rahul


(Mark Walkom) #2

You are better off having a small users index, that leverages routing.
Then the same for medium users, but more shards.
Then the large users get their own index.


(Rahul Sharma) #3

Hi Mark, thank you for your time and answering.

How can we mix data from different small clients into one index? Will it create a security issue, as it is very sensitive data?

If we go for small number of shards per index per small customers what problems we can face?
The system is high on query volume and low on writes and updates.

Regards,
Rahul


(Mark Walkom) #4

Shield can help you deal with this.

Lots of small indices/shards wastes system resources.


(Rahul Sharma) #5

Hi Mark, we have many machines at our disposal, so thats not a big issue.
Also with shield usage we had to disable the java security manager in ES as we are making external calls to other services for social ranking, will that be an issue with Shield?

Regards,
Rahul


(Rahul Sharma) #6

Also if we have many clients in one index, will scoring and IDF things will get corrupted and will give us some different ranking.

Please share your thoughts on this also.

Thanks for all your help.

Regards,
Rahul


(Christian Dahlqvist) #7

How many users do you expect to support? Are you in control of mappings?


(Rahul Sharma) #8

Each client is a unique id we have. And there are around 300 small clients (each client has around 1000's work force who will query our search clusters).

There are around 400 medium clients and around 40 big ones.


(Christian Dahlqvist) #9

Having an index per user does tend to scale badly and lots of small indikes waste system resources due to the overhead associated with each shard. If you however expect less than a thousand users, going with a single, separate index per user may actually work. Small users should probably have indices with just a single shard, and this may, depending on data volumes, also apply to medium users too.

Best way to find out is to test under as realistic conditions as possible.

Make sure that you use Elasticsearch 2.x so you can benefit from delta cluster state updates.


(Rahul Sharma) #10

Hi Christian, thank you for your advice. Can you please tell me how the IDF scoring is effected if we have multiple customers in one index?

Regards,
Rahul


(Mark Walkom) #11

Scoring is per shard, so if you use routing then each customers scoring will be relevant to their own.


(Rahul Sharma) #12

Thanks a lot for your input, they really open horizon for us.


(Christian Dahlqvist) #13

When you use routing, all documents belonging to a customer will be located in the same shard, but there will be multiple users per shard, so it could impact storing.


(Rahul Sharma) #14

Routing if fine when u want to store specific data on some shards, but here data is very general apart from language there is no way to differentiate in routing. The queries to the system are general searching the document by title and body.
Do you have more inputs on routing?

Thanks a lot for all your help.

Regards,
Rahul


(system) #15