Background & Setup
I have an Elasticsearch index that stores information about users. Each user belongs to a group, and every user document contains a group_id
field. All queries to Elasticsearch are filtered by group_id
, since all the required functionalities are group based.
Current Approach and Problem
To optimize query performance, I am using group_id
as the routing key so that all documents for a group are stored together on the same shard.
However, this approach has led to significant shard imbalance:
- Some groups are extremely large, containing millions of users, while others are very small (e.g., 10k users).
- This results in some shards being large (hotspots ~50% difference), while others are way smaller.
- The imbalance might cause performance degradation and operational challenges.
Questions
- Routing Strategy:
- Is it advisable to use a composite routing key (e.g.,
group_id:user_id
) to improve shard balance? - What are the trade-offs in terms of query performance and data locality?
- Index Design:
- Should I create dedicated indices for very large groups?
- Are there recommended patterns for handling “hot” groups?
- Shard Sizing:
- How can I maintain shard sizes within best practice limits (10–50GB) given the uneven distribution of users across groups?
- Alternative Solutions:
- Are there other techniques (e.g., custom routing logic, data tiering, index lifecycle management) that could help achieve a more balanced and scalable setup?