What is the best indexing strategy in Elasticsearch for many small shards?

I'm new to Elasticsearch and I'm trying to understand the best practices to improve performance in my scenario. I currently have 18 indices, one for each environment and location, each index has 1 shard and is < 1 GB in size:
\> Examples: search_books_us, search_forum_us, search_books_br, search_forum_br, etc.
The document structure is basically the same across indices:

{  
  "title": "text",  
  "description": "text",  
  "content": "text",  
  "type": "text"
}

I want to keep the separation by location (US, PT, etc.), but I'm not sure if it’s better to:

  1. Keep the 18 small indices (~1 GB each).
  2. Merge related indices, for example:
    * merge all search_books_* with search_forum_*
    * end up with ~9 indices (~2 GB each)
    * distinguish document types using a field filter:
"filter": [{ "term":{ "type": "FORUM" }}]

3. Merge everything into a single large index (~18 GB) and filter by both type and an adition field location. (I’m not very comfortable with this option because I prefer keeping the location context separated.)

I’ve read some topics about this, but they are based on old/deprecated Elasticsearch versions:

I also read recommendations saying that shards in the 20–40 GB range are generally healthy, but in my case options 1 and 2 would still result in much smaller shard sizes, even after merging. So my question is:

- Is it better to keep many small indices, merge some of them and use filters, or aim for fewer/larger shards?

Am I heading in the right direction? Any guidance on the trade-offs (query speed, resource usage, cluster/heap overhead, search strategy, etc.) would be really helpful. Thanks!

What type of performance problems are you experiencing?

What is the number of concurrent queries you need to be able to support? What are your latency requirements?

Is your querying aligned with how you have organised your indices, e.g. are you generally querying a single index at a time?

Why do you want to do this?

I would expect this to depend on how you are querying this data.

Your data set is so small that this is not an issue. This recommendation comes from having seen numerous deployments with thousands of very small shards, which is quite inefficient.

Having 18 indices and shards is still a small number so I do not necessarily see a problem with any of the approaches you describe. Either may be optimal depending on your querying patterns.

If you are seeing performance problems it may be a result of insifficient system resources instead of sharding. What is the size and specification of your node/cluster?

1 Like

Thank you for your attention, Christian! I apologize if I sounded too much like a layman; I got a bit lost.

I initially interpreted the number of shards as a problem, but after analyzing the points you made, I see there isn't a bottleneck in my case. I was unsure whether the best practice would be to merge the indices or keep them separate.

Analyzing the latency of the different configurations, I see that the query time remains below 350ms, regardless of the setup. My current machine has 6 vCPUs and 50GB of RAM (it's also used for other purposes).

I'm very grateful for your response! I don't think I'm facing a real problem here.

Just to add more context and answer your questions:

  • I hadn't identified any real performance bottlenecks; I was trying to understand what the appropriate shard allocation would be for a basic case like this one.
  • I need to support up to 5 concurrent requests, and my ideal latency is closer to 200ms.
  • That's right, I query one index at a time.
  • I want to keep them separate for the simplicity of removing and recreating one of the indices as needed.

Finally, just to close my doubts... Does it make sense that in this low-data scenario, having a separate index for the forum could be a “lower cost" option, considering we wouldn't need to apply a filter to the query?

Try to make sure the full data set fits in the operating system page cache so you are not limited by disk I/O. Larger heap does not generally mean better performance unless you are seeing issues with long or frequent GC.

Then I would keep the structure as it is.

I do not think it makes much of a difference at this scale.

1 Like