How to Balance Shards When Routing by field in Elasticsearch?

Background & Setup

I have an Elasticsearch index that stores information about users. Each user belongs to a group, and every user document contains a group_id field. All queries to Elasticsearch are filtered by group_id, since all the required functionalities are group based.

Current Approach and Problem

To optimize query performance, I am using group_id as the routing key so that all documents for a group are stored together on the same shard.

However, this approach has led to significant shard imbalance:

  • Some groups are extremely large, containing millions of users, while others are very small (e.g., 10k users).
  • This results in some shards being large (hotspots ~50% difference), while others are way smaller.
  • The imbalance might cause performance degradation and operational challenges.

Questions

  1. Routing Strategy:
  • Is it advisable to use a composite routing key (e.g., group_id:user_id) to improve shard balance?
  • What are the trade-offs in terms of query performance and data locality?
  1. Index Design:
  • Should I create dedicated indices for very large groups?
  • Are there recommended patterns for handling “hot” groups?
  1. Shard Sizing:
  • How can I maintain shard sizes within best practice limits (10–50GB) given the uneven distribution of users across groups?
  1. Alternative Solutions:
  • Are there other techniques (e.g., custom routing logic, data tiering, index lifecycle management) that could help achieve a more balanced and scalable setup?