Query latency spike when a node joins the cluster

Hi,

We have Elasticsearch 8.6.0 with ltr plugin running on AWS EC2.

Each time a new instance (data node) joins the cluster, we see a short (< 1 min) spike in latency. The maximum latency can rise to 4-5 seconds. This happens directly after the new node joins the cluster and 5-6 minutes before the new node is ready to join the load balancers target group. The slow requests are distributed over all old nodes in the cluster.

Any idea why this might be happening and how to deal with it?

Thank you.

What is the specification of the cluster in terms of node count, configuration, instance types used, type and size of storage used?

What is the full output of the cluster stats API?

What is the use case?

What kind of load is the cluster under?

Currently we have ~10 data nodes and 3 master nodes. We use graviton instances (m7g.16xlarge, m6g.16xlarge, r7g.16xlarge, r6g.16xlarge) but we noticed the same problem with amd. Since our index is small we have it in RAM. We have 6 shards.

A part of configuration file:

http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
indices.queries.cache.size: 20%
indices.requests.cache.size: 20%
cluster.routing.allocation.node_concurrent_incoming_recoveries: 6
indices.recovery.max_bytes_per_sec: 500mb
action.auto_create_index: .watches,.triggered_watches,.watcher-history-*,.monitoring-*,logstash*,performance*,-*
thread_pool.search.queue_size: 2000
cluster.deprecation_indexing.enabled: false
indices.lifecycle.history_index_enabled: false
ingest.geoip.downloader.enabled: false
xpack.ml.enabled: false
transport.compress: true
transport.compression_scheme: lz4

The cluster stats API output:

{ -
  "_nodes": { -
    "total": 14,
    "successful": 14,
    "failed": 0
  },
  "cluster_name": "nnnnnnnn",
  "cluster_uuid": "nnnnnnnnnnnnn",
  "timestamp": 1693809168234,
  "status": "green",
  "indices": { -
    "count": 2,
    "shards": { -
      "total": 77,
      "primaries": 7,
      "replication": 10,
      "index": { -
        "shards": { -
          "min": 11,
          "max": 66,
          "avg": 38.5
        },
        "primaries": { -
          "min": 1,
          "max": 6,
          "avg": 3.5
        },
        "replication": { -
          "min": 10,
          "max": 10,
          "avg": 10
        }
      }
    },
    "docs": { -
      "count": 2256514,
      "deleted": 945253
    },
    "store": { -
      "size_in_bytes": 110888420089,
      "total_data_set_size_in_bytes": 110888420089,
      "reserved_in_bytes": 0
    },
    "fielddata": { -
      "memory_size_in_bytes": 242264,
      "evictions": 0
    },
    "query_cache": { -
      "memory_size_in_bytes": 1409329681,
      "total_count": 922503276,
      "hit_count": 469255704,
      "miss_count": 453247572,
      "cache_size": 427032,
      "cache_count": 1752569,
      "evictions": 1325537
    },
    "completion": { -
      "size_in_bytes": 0
    },
    "segments": { -
      "count": 1729,
      "memory_in_bytes": 0,
      "terms_memory_in_bytes": 0,
      "stored_fields_memory_in_bytes": 0,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 0,
      "points_memory_in_bytes": 0,
      "doc_values_memory_in_bytes": 0,
      "index_writer_memory_in_bytes": 362346412,
      "version_map_memory_in_bytes": 7838854,
      "fixed_bit_set_memory_in_bytes": 4503120,
      "max_unsafe_auto_id_timestamp": -1,
      "file_sizes": { -

      }
    },
    "mappings": { -
      "total_field_count": 340,
      "total_deduplicated_field_count": 340,
      "total_deduplicated_mapping_size_in_bytes": 2008,
      "field_types": [ -
        { -
          "name": "boolean",
          "count": 19,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "byte",
          "count": 4,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "date",
          "count": 7,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "dense_vector",
          "count": 1,
          "index_count": 1,
          "indexed_vector_count": 0,
          "indexed_vector_dim_min": 1024,
          "indexed_vector_dim_max": 0
        },
        { -
          "name": "double",
          "count": 9,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "float",
          "count": 22,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "geo_point",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "integer",
          "count": 36,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "keyword",
          "count": 90,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "long",
          "count": 54,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "nested",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "object",
          "count": 18,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "short",
          "count": 4,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "text",
          "count": 74,
          "index_count": 2,
          "script_count": 0
        }
      ],
      "runtime_field_types": [ -

      ]
    },
    "analysis": { -
      "char_filter_types": [ -

      ],
      "tokenizer_types": [ -

      ],
      "filter_types": [ -

      ],
      "analyzer_types": [ -
        { -
          "name": "custom",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_char_filters": [ -

      ],
      "built_in_tokenizers": [ -
        { -
          "name": "standard",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_filters": [ -
        { -
          "name": "asciifolding",
          "count": 1,
          "index_count": 1
        },
        { -
          "name": "lowercase",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_analyzers": [ -
        { -
          "name": "standard_lowercase_analyzer",
          "count": 1,
          "index_count": 1
        }
      ]
    },
    "versions": [ -
      { -
        "version": "8.6.0",
        "index_count": 2,
        "primary_shard_count": 7,
        "total_primary_bytes": 10326246824
      }
    ],
    "search": { -
      "total": 2589993,
      "queries": { -
        "geo_distance": 496096,
        "script_score": 128384,
        "bool": 2429507,
        "function_score": 139069,
        "terms": 2429507,
        "match": 95369,
        "constant_score": 2263578,
        "exists": 2429507,
        "range": 2207610,
        "term": 2588402,
        "nested": 3481,
        "script": 1
      },
      "sections": { -
        "stored_fields": 680672,
        "query": 2588402,
        "_source": 521777,
        "docvalue_fields": 680098,
        "rescore": 139069,
        "aggs": 4481
      }
    }
  },
  "nodes": { -
    "count": { -
      "total": 14,
      "coordinating_only": 0,
      "data": 11,
      "data_cold": 0,
      "data_content": 0,
      "data_frozen": 0,
      "data_hot": 0,
      "data_warm": 0,
      "index": 0,
      "ingest": 0,
      "master": 3,
      "ml": 0,
      "remote_cluster_client": 0,
      "search": 0,
      "transform": 0,
      "voting_only": 0
    },
    "versions": [ -
      "8.6.0"
    ],
    "os": { -
      "available_processors": 716,
      "allocated_processors": 716,
      "names": [ -
        { -
          "name": "Linux",
          "count": 14
        }
      ],
      "pretty_names": [ -
        { -
          "pretty_name": "Amazon Linux 2",
          "count": 14
        }
      ],
      "architectures": [ -
        { -
          "arch": "aarch64",
          "count": 14
        }
      ],
      "mem": { -
        "total_in_bytes": 2970705608704,
        "adjusted_total_in_bytes": 2970705608704,
        "free_in_bytes": 2408823738368,
        "used_in_bytes": 561881870336,
        "free_percent": 81,
        "used_percent": 19
      }
    },
    "process": { -
      "cpu": { -
        "percent": 154
      },
      "open_file_descriptors": { -
        "min": 745,
        "max": 1115,
        "avg": 1016
      }
    },
    "jvm": { -
      "max_uptime_in_millis": 2486934,
      "versions": [ -
        { -
          "version": "19.0.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "19.0.1+10-21",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 14
        }
      ],
      "mem": { -
        "heap_used_in_bytes": 150993357824,
        "heap_max_in_bytes": 381178347520
      },
      "threads": 3380
    },
    "fs": { -
      "total_in_bytes": 751619276800,
      "free_in_bytes": 623281008640,
      "available_in_bytes": 623281000448
    },
    "plugins": [ -
      { -
        "name": "ltr",
        "version": "1.5.8-es8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "Learning to Rank Query w/ RankLib Models",
        "classname": "com.o19s.es.ltr.LtrQueryParserPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": false,
        "legacy_interfaces": [ -
          "ActionPlugin",
          "AnalysisPlugin",
          "ScriptPlugin",
          "SearchPlugin"
        ],
        "legacy_methods": [ -
          "createComponents",
          "getActions",
          "getContexts",
          "getFetchSubPhases",
          "getNamedWriteables",
          "getNamedXContent",
          "getPreConfiguredTokenFilters",
          "getPreConfiguredTokenizers",
          "getQueries",
          "getRestHandlers",
          "getScriptEngine",
          "getSearchExts",
          "getSettings"
        ]
      },
      { -
        "name": "analysis-icu",
        "version": "8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
        "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": true
      },
      { -
        "name": "discovery-ec2",
        "version": "8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname": "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": true
      }
    ],
    "network_types": { -
      "transport_types": { -
        "security4": 14
      },
      "http_types": { -
        "security4": 14
      }
    },
    "discovery_types": { -
      "multi-node": 14
    },
    "packaging_types": [ -
      { -
        "flavor": "default",
        "type": "rpm",
        "count": 14
      }
    ],
    "ingest": { -
      "number_of_pipelines": 0,
      "processor_stats": { -

      }
    },
    "indexing_pressure": { -
      "memory": { -
        "current": { -
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0
        },
        "total": { -
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0,
          "coordinating_rejections": 0,
          "primary_rejections": 0,
          "replica_rejections": 0
        },
        "limit_in_bytes": 0
      }
    }
  }
}

We use it for item search by building queries on indexed fields combined with LTR for relevance ranking. _source is not enabled.

We have about 10k queries per second when scaled like this and indexing rate of 13k documents per second.

This'd explain it I think. By default Elasticsearch will throttle shard movements so as not to affect performance, but you're overriding the default behaviour with these two settings: you're basically telling Elasticsearch to ignore any possible performance impact of shard movements. But if you care about that performance impact, you probably want to revert this back to the default.

1 Like

Thank you for the suggestion.

I tried removing these properties and unfortunately we still have latency peaks when an instance joins the cluster. It did noticeably increase the deployment time, as one would expect.

Do you maybe have any other ideas why this might be happening?

When looking at our cluster, I notice we (almost) always have all primaries on the same node. Could this be a bottleneck?

I can't think of anything other than shard recoveries that might have a performance impact when a node joins the cluster.

I would not expect that, no. Is this something that only happens when adding a node?

No, distribution is like this all the time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.