Need help deciding how to partition data

The document in my RDMS has a schema similar to this:

{
PatientId: "string",
Date: "date",
IsAvailable: "bool",
_hospitalId: "6-digit number which can be a number or a string"
}

The _hospitalId acts like a partition key. All search queries will take place within the same _hospitalId. There are about 10,000 hospitals, with each hospital having between 100,000 to millions of patients.

What is the best way to divide the data in Elasticsearch? The ideal scenario would be to have each hospital as a separate index, but this many indexes is not recommended. How should I go about this?

If you always, or in most cases, are querying with a filter on a specific field and the cardinality of this field is reasonably large and the number of documents per field value is reasonably well distributed (which seems to be the case in your scenario) you can use a single index combined with index routing.

You set up an index with a reasonable number of primary shards, and we will for this example assume 10 primary shards gives you a good shard size. Whe you index a document you use the hospitalId as raouting key, which means all documents for that hospital will go to the same shard. The shard will contain documents for multiple hospitals though, which makes this approach much more efficient than single index per hospital.

When you query for a single hospital you also provide the hostpitalId as routing key, and this allows Elasticsearch to only query one of the 10 shards as it knows all relevant data is located there.

If you have 10 primary shards only 10% of the data is queried for each request. If you have 20 primary shards only 5% of the data is queries etc.

When you use routing it is important to not have too many primary shards and be aware that the shard size may not be uniform as it is possible some of the larger hospitals may end up in the same shard (which is why you want to consider cardinality and distribution of data set size).

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.