Add back _primary* search preferences

We run large time based clusters with 1 primary and 1 replica on each node. We liked the primary* search options so that no extra heap or disk cache is used when searching, only the primary is loaded. With out the primary options, my assumption is now it is possible if multiple searches are done that multiple primary and replicas of every index could be loaded on every machine. For those of us with low volume of search that search a large amount of data, this would cut our memory cache/heap in almost 1/2.

The "The cache-related benefits of these options can also be obtained using _only_nodes, _prefer_nodes, or a custom string value instead." from the online docs doesn't apply to us since every node has a shard for disk capacity and speed.

Am I thinking about this wrong?

(https://github.com/elastic/elasticsearch/issues/41115)

Hi @Andy_Wick,

I think that ?preference=custom_string_value does what you want. If you use the same custom_string_value for every search then every coordinating node will choose the same copy of each shard for each search, so will generally pick copies with hot caches. The choice will change if shards move around (much as with ?preference=_primary). Unlike ?preference=_primary, if the chosen copy is overloaded then ?preference=custom_string_value will fall back to another copy.

Can you explain a bit more why this doesn't work for you? In particular I don't understand this:

If you want to search the local shards you can use ?preference=_local, but this doesn't have the cache benefits you were talking about earlier.

Its a feature request, or those must be discussed first on the forum?

It's not yet clear that this is the feature you need (see above) so yes it's best to discuss this here first. GitHub issues trigger notifications for lots of people so it's best to try and avoid too much exploration there; if we can work out why ?preference=custom_string_value doesn't fit your use case then of course we can reopen your issue.

Unless I'm reading the documents wrong, ?preference=custom_string_value could still load replicas into memory? _local won't work because I have I have 50 nodes. (I think my original message was unclear when I said 1p & 1r per node, I meant there would be 50p & 50r on a 50 node cluster for each index and forced shards per node of 2 so its even spread.)

To me it seems like preference=custom_string_value optimizes for single search speed, where I want to optimize for health of the cluster. We run big, slow, relatively low query volume, time based clusters. We want to treat replicas as only for resilience not for increasing search speed. (If I could "freeze/close" replicas so they were only used when the primary failed that would be amazing.) .

i guess the question is how sensitive is ?preference=custom_string_value about deciding a node is overloaded and sending a search query to the replicas, and now causing that node to push other things out of cache and cascading issues.

A lot of data structures are kept on heap even if the node is not searched. Have you quantified how much difference the caching caused by querying actually makes?

Not with 6.0 or 7.0. I'm not just talking about heap, but also disk cache, which is the much larger memory amount.

Thoughts on how to test?

Thanks,
Andy

I'm not 100% sure what "load into memory" means in this context, or how it differs between ?preference=custom_string_value and ?preference=_primary. In both cases all shard copies (primary and replica) are running, so they consume a certain amount of heap, but if a shard copy (primary or replica) is never touched by any search or indexing then it probably won't be consuming any filesystem cache.

This is an interesting idea. It depends how long you're willing to wait for the replica to be promoted once the primary fails. If you need it up immediately then it would already need to be running as it is today; if you can wait long enough then perhaps we could recover it from a snapshot; the middle ground could be similar to frozen indices except only freezing some of the copies.

It'd fall back to another shard only under the same circumstances that a search with ?preference=_primary would fail (or return partial results). Does this happen very often in your cluster?

It's possible we could suppress this fallback behaviour, or apply a bound on how many times to fall back (you would want 0).

To get an indication I guess you could run a reasonably long query test without any preference at all and then compare this to a similar test where you set a single preference for all queries, which will cause the same shards to be repeatedly queried.

Are you running searches and returning documents or primarily aggregations?

I'm not 100% sure what "load into memory" means in this context, or how it differs between ?preference=custom_string_value and ?preference=_primary . In both cases all shard copies (primary and replica) are running, so they consume a certain amount of heap, but if a shard copy (primary or replica) is never touched by any search or indexing then it probably won't be consuming any filesystem cache.

I mean heap or disk cache, any memory.

The scenario is low volume constant queries (with aggregations) against 100s-1000s of shards across 10s of nodes. The goal is to only have the minimal amount of memory being used, so other queries will benefit. Before _primary* it was possible to have near 2x the amount of memory required being used because queries could be sent to both primary and replicas. (And I get that its under 2x because of the heap that is required per shard etc etc, but easier to just say 2x.) . This we both measured and had feedback from elastic about.

Now _primary* has been removed and the suggestion is use use preference=foobar, that will do a consistent hash to the shards to be queried every time as long as you keep using preference=foobar AND the same cluster state. So it's "almost" the same as using _primary, but instead of forcing elasticsearch to use the primaries you allow it to pick which shards to use instead of have a deterministic choice.

I guess it comes down to trust, and maybe I should just trust it won't go back to how it was before, being able to cache 1/2 as much because everything was loaded 2x times. :slight_smile:

My concern is around the language that this is cluster state based. Does creating a new index, such as we constantly do with time based indices, qualify as a cluster state change, which could then tell the algorithm to pick new shards to send queries to? Basically throwing away many caches if it doesn't pick the same shards.

It is perhaps worth noting that you have no control over which shard copies are the primaries either :slight_smile:

I cannot see how this would happen with today's code, but I also cannot find any existing tests that prevent us from breaking this in future. I opened https://github.com/elastic/elasticsearch/pull/41150 to add such a test, which will also trigger a discussion about whether this property is something we actually want to offer a guarantee about.

It is perhaps worth noting that you have no control over which shard copies are the primaries either :slight_smile:

True, but I can visually see where the primaries are, and I have a sense of control (real or not :wink:) of whats going on. Do I have a way to see/query where queries with preference=foobar will be sent?
Maybe it doesn't matter and I just need to let go and stop being a muggle.

Thanks for adding a test case.

I think the profile API is a better way to dig into the details of the execution of a search:

GET /_search?filter_path=profile.**.id&preference=foobar
{
  "profile": true
}

# 200 OK
# {
#   "profile": {
#     "shards": [
#       {
#         "id": "[R1sfCGiGQ_OhnV6N-qETZw][i][2]"
#       },
#       {
#         "id": "[R1sfCGiGQ_OhnV6N-qETZw][j][0]"
#       },
#       {
#         "id": "[myzT_-L2Q5WbRZIFvbYVDw][i][0]"
#       },
#       {
#         "id": "[myzT_-L2Q5WbRZIFvbYVDw][j][1]"
#       },
#       {
#         "id": "[xqX-M94IT7agrTn-wkJhUw][i][1]"
#       }
#     ]
#   }
# }

There's lots more info too, I'm using filter_path to cut it down. Here:

  "id": "[xqX-M94IT7agrTn-wkJhUw][i][1]"
                                 ^^^^^^ index name and shard number
          ^^^^^^^^^^^^^^^^^^^^^^ node ID

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.