We have a use-case where we have all the queries querying within a specific partition of data. For example, our document looks like
PersonID: Unique id of the person
First Name: First Name of Person
Last Name: Last Name of Person
City: Name of City
PinCode: (9 digit in general but can be transformed into string for optimization)
Description: Description of Person
Hobbies: List of Hobbies
Income_range: Value Between 1-5
Score: Sorting Score
We have around 1 billion records with Pincode in order of 50 million. Each record is on average 1 KB in size. There can be updates changing the pincode of each record.
All our queries always specify the Pincode and then optionally add other filters. For example,
- Find Persons in PinCode=123 and Is_Senior_Citizen = 1
- Find Persons in PinCode=456 and income range = 4 and description contains "good"
All our queries are always run on a specific partition of data which is partitioned by PinCode. Further we need the output to be always sorted by Score irrespective of any relevancy criteria. Is there is way to optimize the queries.
I know there are alias and _routing.
We can't use one alias for each PinCode as pincodes change for a person very frequently. Further each alias create a different index and having more than 50 million index doesn't look scalable.
For routing, if we use PinCode as parameter to do custom routing, when we change the cluster size, does elastic search automatically re-indexes the records to the right shards/nodes. Further, do we know how much performance boost can be expect.
I am open to other suggestions on how can we setup the cluster and improve the performance. Our queries need to serve a live website so are looking for good querying latency.
How many concurrent queries are you looking to serve? How often are documents updated and/or indexed?
We are looking to serve around 200 queries per second (assuming 10m ms latency per query it will be around 20 concurrent queries).
We are expecting to update the cluster with around 5000 updates per second.
Routing would allow each query to be served by a single shard, so would make it more efficient to support large number of queries. The cost of this is however that updates where the pin code changes need to be handled as a delete followed by an insert rather than a normal update as the may need to be moved to a different shaft. As you have a fair amount of updates this may not be worthwhile. The only way to know for sure which setup that is best for your use case is to benchmark it with realistic data and load mix.
Thanks @Christian_Dahlqvist for your response. I am also wondering if changing the refresh interval would help here like increasing it to 30 seconds so that multiple updates to a shard can be processed together. How does that fare when we are already posting the updates in bulk?
It can certainly help, but will delay the updates being available for search. If you went with routing and moving documents between shards it could also increase the time data may show up in different shards.
Thanks @Christian_Dahlqvist for your prompt responses.
How does rebalancing works when using custom routing? For example
Let say we have 5 shards initially and 100 records and with custom routing lets assume each shard gets 100 records. Later when we decide to increase the number of shards to 10, how will elastic search redistribute the records? I guess to figure out the right shard elastic search will need to know the routing key. Does elastic search store the routing key along with the record? Or the application owner needs to manage the rebalancing the cluster?
Further is there a way to specify _id and _routing in the document/record? I know we have to specify them in URL but was wondering if we can specify them in the document as that will be more easier to manage and debug.
Are you referring to changing the number of shards using the split index API?
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.