Shard sizing and performance with preference=_primary_first

Hi All,

I experience slow search performance on a quite large index (we talk about 5-10 seconds for a simple term query on a long field with returns ~10k docs).

First of all some details about my cluster:

  • Elasticsearch version: 2.2.0
  • 4 nodes with 64GB RAM (30GB heap) and 24 cores each
  • Storage: RAID 10 spinning disks (ye ... I know its not ideal)
  • Index:
    • ~ 2,000,000,000 docs
    • overall size is 500GB
    • 4 Shards (replication = 1 - see below)
  • Search volume is quite low but with a lot of aggregations and performance matters a lot

I initially tried 2 shards and 1 replica but the search performance was not that great. I guess that's because I then utilize only two of my 4 nodes with every query I run (which than have to search a 250GB segment).

Now I tried to do the following in order to get better utilization of resources across all 4 nodes:

  • 4 shards to have one primary shard on each node
  • run every search request with preference=_primary_first because I only wanted to have the replicas for failover.
    I already learned that ES does not balance primary shards (only overall shards) but I manually balanced them on the right nodes (with the relocate API) for the purpose of testing this.

So now I have one primary and one replica shard on every node.
When I run a search with preference=_primary_first it was still quite slow and I noticed using hot_threads that still two threads run searches on the same node.
Does the preference setting really work and is it intented to be used for such a case? Does someone have experience with that?

Now when I set the replication factor to 0 with 4 shards, I get great performance (500-3000ms uncached). Exactly what I need.
I could theoretically just run with 0 replication in Production because the data is already mirrored on the RAID.
Practically I don't think its a good idea because from time to time I run into GC issues and have to restart nodes.

Does anyone have an idea how I can achieve running single queries across all 4 nodes and still have failover replica shards in place?


Query and aggregation latency usually depend on the size of the shards, and it sounds like your shards may be too large. Each query is executed against all shards in parallel, but a single thread is used per query per shard. By having more smaller shards, you will be able to get more work done in parallel and better utilise the CPU cores you have at your disposal.

I generally recommend that you measure the latency of typical queries and aggregations against a single shard while step by step increasing the size of the shard by adding more data. This will show you how the latency varies with shard size and will allow you determine the ideal shard size. Then divide the expected size of the total data set by this to get the number of shards to use.

We often recommend that you keep the shard size below 50GB in size, as very large shards tend to affect recovery speed. The ideal size might however be lower than this.

Hi Christian,

thanks for the feedback.!
Will try to identify the ideal shard size with this approach.
Can I really do this with running the same set of queries over and over again?
The results are going to be manipulated by IO Buffering and Caching in ES or?

Anyway, let's say I end up with e.g. 16 shards in the "ideal" size.
This would be overall 32 shards (with replica of one).
That means 8 shards per node which could theoretically all be used by one single query.

Do you think that it's okay to run 8 lucene threads to do Disk IO concurrently on a spinning media RAID 10?
I always thought running Disk IO with too many concurrent threads would be counterproductive.

Also I find it hard to measure whether workload during queries is IO bound or CPU bound.
The lowest resolution that I can geht e.g. from iostat is one second (which it to high to measure queries that run 2-5 seconds).

8 shards per node is not very much. As there are a significant number of parameters that affect performance for a specific use case, the best way to find out is to benchmark.

Okay, thank you! I'll try with more shards and see what I get.
Looks like up to ~30GB Shard size I get good performance.

Running the same queries over and over may allow Elasticsearch to cache more than is realistic, so I would recommend generating queries based on parameters to get as close to realistic querying as possible.

Now I though I found my sweet spot with a shard size of 25-30GB.
I was constantly running uncached queries while loading up my index (simple term queries on y long field with sorting on a short fiel).
During indexing everything looked great. Had very good response times (500-1200ms) even up to shortly before the indexing finished.

I typically create my indices with 0 replication to later set it to 1 (on order to increase indexing throughput).
After I did that I am back on 2000-3500ms for the same queries.

When I change it back to 0, it's back town to 500-1200ms.

Is there some sort of query time overhead that comes with replication?