Partitioning Data in Elastic Search

Chinmaya_Poswalia · August 10, 2018, 10:19pm

We have a use-case where we have all the queries querying within a specific partition of data. For example, our document looks like

PersonID: Unique id of the person
First Name: First Name of Person
Last Name: Last Name of Person
City: Name of City
PinCode: (9 digit in general but can be transformed into string for optimization)
Description: Description of Person
Hobbies: List of Hobbies
Is_Senior_Citizen: boolean
Income_range: Value Between 1-5
Score: Sorting Score

We have around 1 billion records with Pincode in order of 50 million. Each record is on average 1 KB in size. There can be updates changing the pincode of each record.

All our queries always specify the Pincode and then optionally add other filters. For example,

Find Persons in PinCode=123 and Is_Senior_Citizen = 1
Find Persons in PinCode=456 and income range = 4 and description contains "good"

All our queries are always run on a specific partition of data which is partitioned by PinCode. Further we need the output to be always sorted by Score irrespective of any relevancy criteria. Is there is way to optimize the queries.

I know there are alias and _routing.
We can't use one alias for each PinCode as pincodes change for a person very frequently. Further each alias create a different index and having more than 50 million index doesn't look scalable.
For routing, if we use PinCode as parameter to do custom routing, when we change the cluster size, does elastic search automatically re-indexes the records to the right shards/nodes. Further, do we know how much performance boost can be expect.

I am open to other suggestions on how can we setup the cluster and improve the performance. Our queries need to serve a live website so are looking for good querying latency.

Christian_Dahlqvist · August 11, 2018, 5:32am

How many concurrent queries are you looking to serve? How often are documents updated and/or indexed?

Chinmaya_Poswalia · August 14, 2018, 9:14pm

We are looking to serve around 200 queries per second (assuming 10m ms latency per query it will be around 20 concurrent queries).
We are expecting to update the cluster with around 5000 updates per second.

Christian_Dahlqvist · August 15, 2018, 3:15am

Routing would allow each query to be served by a single shard, so would make it more efficient to support large number of queries. The cost of this is however that updates where the pin code changes need to be handled as a delete followed by an insert rather than a normal update as the may need to be moved to a different shaft. As you have a fair amount of updates this may not be worthwhile. The only way to know for sure which setup that is best for your use case is to benchmark it with realistic data and load mix.

Chinmaya_Poswalia · August 15, 2018, 4:24am

Thanks @Christian_Dahlqvist for your response. I am also wondering if changing the refresh interval would help here like increasing it to 30 seconds so that multiple updates to a shard can be processed together. How does that fare when we are already posting the updates in bulk?

Christian_Dahlqvist · August 15, 2018, 4:37am

It can certainly help, but will delay the updates being available for search. If you went with routing and moving documents between shards it could also increase the time data may show up in different shards.

Chinmaya_Poswalia · August 15, 2018, 11:49pm

Thanks @Christian_Dahlqvist for your prompt responses.
How does rebalancing works when using custom routing? For example
Let say we have 5 shards initially and 100 records and with custom routing lets assume each shard gets 100 records. Later when we decide to increase the number of shards to 10, how will elastic search redistribute the records? I guess to figure out the right shard elastic search will need to know the routing key. Does elastic search store the routing key along with the record? Or the application owner needs to manage the rebalancing the cluster?

Further is there a way to specify _id and _routing in the document/record? I know we have to specify them in URL but was wondering if we can specify them in the document as that will be more easier to manage and debug.

Christian_Dahlqvist · August 16, 2018, 10:21am

Are you referring to changing the number of shards using the split index API?

system · September 13, 2018, 10:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Routing performance tuning Elasticsearch	5	1505	July 5, 2017
Requests Containing an Array of Custom Routing Values Elasticsearch	7	651	January 2, 2019
How to distribute documents across shards equally, using _routing in Elasticsearch Elasticsearch	10	719	April 25, 2022
Custom routing + shard over-allocation Elasticsearch	1	388	July 6, 2017
How do nodes and shards work in this scenario? Elasticsearch	5	511	October 8, 2019

Partitioning Data in Elastic Search

Related topics