We are trying to design search engine for our customers.
We have many customers, some have small data, some have medium size data and some have really big data.
We are trying to go for one index per customer.
Also for small customers we want less shards and for medium one and larger ones we want an optimal number of shards.
Can anyone suggest us what is the right approach for this? Please share your views it will be helpful.
You are better off having a small users index, that leverages routing.
Then the same for medium users, but more shards.
Then the large users get their own index.
How can we mix data from different small clients into one index? Will it create a security issue, as it is very sensitive data?
If we go for small number of shards per index per small customers what problems we can face?
The system is high on query volume and low on writes and updates.
Hi Mark, we have many machines at our disposal, so thats not a big issue.
Also with shield usage we had to disable the java security manager in ES as we are making external calls to other services for social ranking, will that be an issue with Shield?
Each client is a unique id we have. And there are around 300 small clients (each client has around 1000's work force who will query our search clusters).
There are around 400 medium clients and around 40 big ones.
Having an index per user does tend to scale badly and lots of small indikes waste system resources due to the overhead associated with each shard. If you however expect less than a thousand users, going with a single, separate index per user may actually work. Small users should probably have indices with just a single shard, and this may, depending on data volumes, also apply to medium users too.
Best way to find out is to test under as realistic conditions as possible.
Make sure that you use Elasticsearch 2.x so you can benefit from delta cluster state updates.
When you use routing, all documents belonging to a customer will be located in the same shard, but there will be multiple users per shard, so it could impact storing.
Routing if fine when u want to store specific data on some shards, but here data is very general apart from language there is no way to differentiate in routing. The queries to the system are general searching the document by title and body.
Do you have more inputs on routing?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.