Elasticsearch search response pick(latency) occurs when _refresh with G1GC

Our elasticsearch cluster has 3 master nodes and 12 data nodes installed on 2 IDCs.

In front of elasticsearch, there is search api server that multisearches three indexes. It shows an average response time of less than 200 ms in normal times.

The index in question is one of three indexes called multisearch, which is a configuration of 4 shards and 2 replicas.
We incrementally index the corresponding index through the bulk API every 5 minutes.

The problem arises here. At every incremental index, some search requests are delayed by more than 2-3 seconds.

We have confirmed the following facts.

  • We tried changing the size of the bulk API to 1, 100, 200, 500, 1000, 2000, but delays occur in all cases.
  • CPU, Memory, and Disk IO do not show any peculiarities in the delay timing
  • Set refresh_interval to -1 and no delay occurs when indexing
  • If _refresh is called some time after indexing, then a delay occurs.
  • When changing jdk's GC from G1GC to CMSGC, there is no delay
  • If you check the gc count with jstat at the delay timing, only young gc, which usually occurs periodically, occurs.
  • Much less heap memory and slightly less CPU resources are used in CMSGC than in G1GC, but this seems to be a natural phenomenon.

We wonder why the problem occurs and how to solve it right.
Any advice is appreciated.

1 Like

I forgot to mention that our elasticsearch version is 7.16.3 and heap size setting is (xmx, xms) 30gb

Are you using parent-child (join datatype) or a lot of nested mappings? If I recall correctly this may result in refreshes taking longer and requiring more work as more in-memory data structures need to be rebuilt.

If you only make changes to the indices every 5 minutes I would recommend trying the following approach:

  1. Create an alias for each of your indices and query through this.
  2. When you want to update the index, you first clone it to create a separate copy. If you want to you may at this point move this to a separate node, but this may not be required.
  3. Then run the indexing and updating against this new, cloned index while you are still querying the old one through the alias.
  4. When done you may run a few queries against the new index to make sure it is warmed up.
  5. You then switch the alias from the old index to the new in a single step to switch all querying to the newly updated index.
  6. The last step is to delete the index that is out of date and no longer used.
1 Like

thanks for the comment
No we do not use join or nested mappings. Indexes have a keyword array field for filtering, a rank_features field for scoring, and an object type field with metadata(not searchable).
And We considered your solution, but we think it's difficult to apply. because we already using aliases for alias switching after static indexing(once in a day period) and I think the index is too big to clone every dynamic indexing time.

our rank_feature field looks like this

"scored_terms" : {
  "lazy" :2.34,
  "fox" ;1.346,
  "jump": 7.543
 // ...

and keyword field for filtering looks like this

"terms": [
  // ,...

"score_terms" and "terms" fields have exactly same terms, but "terms" is used for filtering (AND matcher searches for documents containing all search terms) and "score_terms" is used for scoring in the should clause. could it be a problem if I use the rank_features field as described above?


I thought refreshes do not block search requests (unless a background refresh is not performed because no search requests have occurred for longer than the index.search.idle.after value). Does refresh block search requests?

Assuming it's not, I guess the problem is caused by segments that are not warmed up. (Segments created by refresh op would not have been cached)

Is there a performance difference between CMSGC and G1GC when searching for segments that have not been warmed up?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.