Query latency spike when a node joins the cluster

marinko · September 4, 2023, 5:29am

Hi,

We have Elasticsearch 8.6.0 with ltr plugin running on AWS EC2.

Each time a new instance (data node) joins the cluster, we see a short (< 1 min) spike in latency. The maximum latency can rise to 4-5 seconds. This happens directly after the new node joins the cluster and 5-6 minutes before the new node is ready to join the load balancers target group. The slow requests are distributed over all old nodes in the cluster.

Any idea why this might be happening and how to deal with it?

Thank you.

Christian_Dahlqvist · September 4, 2023, 5:59am

What is the specification of the cluster in terms of node count, configuration, instance types used, type and size of storage used?

What is the full output of the cluster stats API?

What is the use case?

What kind of load is the cluster under?

marinko · September 4, 2023, 6:49am

Currently we have ~10 data nodes and 3 master nodes. We use graviton instances (m7g.16xlarge, m6g.16xlarge, r7g.16xlarge, r6g.16xlarge) but we noticed the same problem with amd. Since our index is small we have it in RAM. We have 6 shards.

A part of configuration file:

http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/localhost(:[0-9]+)?/
indices.queries.cache.size: 20%
indices.requests.cache.size: 20%
cluster.routing.allocation.node_concurrent_incoming_recoveries: 6
indices.recovery.max_bytes_per_sec: 500mb
action.auto_create_index: .watches,.triggered_watches,.watcher-history-*,.monitoring-*,logstash*,performance*,-*
thread_pool.search.queue_size: 2000
cluster.deprecation_indexing.enabled: false
indices.lifecycle.history_index_enabled: false
ingest.geoip.downloader.enabled: false
xpack.ml.enabled: false
transport.compress: true
transport.compression_scheme: lz4

The cluster stats API output:

{ -
  "_nodes": { -
    "total": 14,
    "successful": 14,
    "failed": 0
  },
  "cluster_name": "nnnnnnnn",
  "cluster_uuid": "nnnnnnnnnnnnn",
  "timestamp": 1693809168234,
  "status": "green",
  "indices": { -
    "count": 2,
    "shards": { -
      "total": 77,
      "primaries": 7,
      "replication": 10,
      "index": { -
        "shards": { -
          "min": 11,
          "max": 66,
          "avg": 38.5
        },
        "primaries": { -
          "min": 1,
          "max": 6,
          "avg": 3.5
        },
        "replication": { -
          "min": 10,
          "max": 10,
          "avg": 10
        }
      }
    },
    "docs": { -
      "count": 2256514,
      "deleted": 945253
    },
    "store": { -
      "size_in_bytes": 110888420089,
      "total_data_set_size_in_bytes": 110888420089,
      "reserved_in_bytes": 0
    },
    "fielddata": { -
      "memory_size_in_bytes": 242264,
      "evictions": 0
    },
    "query_cache": { -
      "memory_size_in_bytes": 1409329681,
      "total_count": 922503276,
      "hit_count": 469255704,
      "miss_count": 453247572,
      "cache_size": 427032,
      "cache_count": 1752569,
      "evictions": 1325537
    },
    "completion": { -
      "size_in_bytes": 0
    },
    "segments": { -
      "count": 1729,
      "memory_in_bytes": 0,
      "terms_memory_in_bytes": 0,
      "stored_fields_memory_in_bytes": 0,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 0,
      "points_memory_in_bytes": 0,
      "doc_values_memory_in_bytes": 0,
      "index_writer_memory_in_bytes": 362346412,
      "version_map_memory_in_bytes": 7838854,
      "fixed_bit_set_memory_in_bytes": 4503120,
      "max_unsafe_auto_id_timestamp": -1,
      "file_sizes": { -

      }
    },
    "mappings": { -
      "total_field_count": 340,
      "total_deduplicated_field_count": 340,
      "total_deduplicated_mapping_size_in_bytes": 2008,
      "field_types": [ -
        { -
          "name": "boolean",
          "count": 19,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "byte",
          "count": 4,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "date",
          "count": 7,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "dense_vector",
          "count": 1,
          "index_count": 1,
          "indexed_vector_count": 0,
          "indexed_vector_dim_min": 1024,
          "indexed_vector_dim_max": 0
        },
        { -
          "name": "double",
          "count": 9,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "float",
          "count": 22,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "geo_point",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "integer",
          "count": 36,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "keyword",
          "count": 90,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "long",
          "count": 54,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "nested",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "object",
          "count": 18,
          "index_count": 2,
          "script_count": 0
        },
        { -
          "name": "short",
          "count": 4,
          "index_count": 1,
          "script_count": 0
        },
        { -
          "name": "text",
          "count": 74,
          "index_count": 2,
          "script_count": 0
        }
      ],
      "runtime_field_types": [ -

      ]
    },
    "analysis": { -
      "char_filter_types": [ -

      ],
      "tokenizer_types": [ -

      ],
      "filter_types": [ -

      ],
      "analyzer_types": [ -
        { -
          "name": "custom",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_char_filters": [ -

      ],
      "built_in_tokenizers": [ -
        { -
          "name": "standard",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_filters": [ -
        { -
          "name": "asciifolding",
          "count": 1,
          "index_count": 1
        },
        { -
          "name": "lowercase",
          "count": 1,
          "index_count": 1
        }
      ],
      "built_in_analyzers": [ -
        { -
          "name": "standard_lowercase_analyzer",
          "count": 1,
          "index_count": 1
        }
      ]
    },
    "versions": [ -
      { -
        "version": "8.6.0",
        "index_count": 2,
        "primary_shard_count": 7,
        "total_primary_bytes": 10326246824
      }
    ],
    "search": { -
      "total": 2589993,
      "queries": { -
        "geo_distance": 496096,
        "script_score": 128384,
        "bool": 2429507,
        "function_score": 139069,
        "terms": 2429507,
        "match": 95369,
        "constant_score": 2263578,
        "exists": 2429507,
        "range": 2207610,
        "term": 2588402,
        "nested": 3481,
        "script": 1
      },
      "sections": { -
        "stored_fields": 680672,
        "query": 2588402,
        "_source": 521777,
        "docvalue_fields": 680098,
        "rescore": 139069,
        "aggs": 4481
      }
    }
  },
  "nodes": { -
    "count": { -
      "total": 14,
      "coordinating_only": 0,
      "data": 11,
      "data_cold": 0,
      "data_content": 0,
      "data_frozen": 0,
      "data_hot": 0,
      "data_warm": 0,
      "index": 0,
      "ingest": 0,
      "master": 3,
      "ml": 0,
      "remote_cluster_client": 0,
      "search": 0,
      "transform": 0,
      "voting_only": 0
    },
    "versions": [ -
      "8.6.0"
    ],
    "os": { -
      "available_processors": 716,
      "allocated_processors": 716,
      "names": [ -
        { -
          "name": "Linux",
          "count": 14
        }
      ],
      "pretty_names": [ -
        { -
          "pretty_name": "Amazon Linux 2",
          "count": 14
        }
      ],
      "architectures": [ -
        { -
          "arch": "aarch64",
          "count": 14
        }
      ],
      "mem": { -
        "total_in_bytes": 2970705608704,
        "adjusted_total_in_bytes": 2970705608704,
        "free_in_bytes": 2408823738368,
        "used_in_bytes": 561881870336,
        "free_percent": 81,
        "used_percent": 19
      }
    },
    "process": { -
      "cpu": { -
        "percent": 154
      },
      "open_file_descriptors": { -
        "min": 745,
        "max": 1115,
        "avg": 1016
      }
    },
    "jvm": { -
      "max_uptime_in_millis": 2486934,
      "versions": [ -
        { -
          "version": "19.0.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "19.0.1+10-21",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 14
        }
      ],
      "mem": { -
        "heap_used_in_bytes": 150993357824,
        "heap_max_in_bytes": 381178347520
      },
      "threads": 3380
    },
    "fs": { -
      "total_in_bytes": 751619276800,
      "free_in_bytes": 623281008640,
      "available_in_bytes": 623281000448
    },
    "plugins": [ -
      { -
        "name": "ltr",
        "version": "1.5.8-es8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "Learning to Rank Query w/ RankLib Models",
        "classname": "com.o19s.es.ltr.LtrQueryParserPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": false,
        "legacy_interfaces": [ -
          "ActionPlugin",
          "AnalysisPlugin",
          "ScriptPlugin",
          "SearchPlugin"
        ],
        "legacy_methods": [ -
          "createComponents",
          "getActions",
          "getContexts",
          "getFetchSubPhases",
          "getNamedWriteables",
          "getNamedXContent",
          "getPreConfiguredTokenFilters",
          "getPreConfiguredTokenizers",
          "getQueries",
          "getRestHandlers",
          "getScriptEngine",
          "getSearchExts",
          "getSettings"
        ]
      },
      { -
        "name": "analysis-icu",
        "version": "8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
        "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": true
      },
      { -
        "name": "discovery-ec2",
        "version": "8.6.0",
        "elasticsearch_version": "8.6.0",
        "java_version": "17",
        "description": "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname": "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
        "extended_plugins": [ -

        ],
        "has_native_controller": false,
        "licensed": false,
        "is_official": true
      }
    ],
    "network_types": { -
      "transport_types": { -
        "security4": 14
      },
      "http_types": { -
        "security4": 14
      }
    },
    "discovery_types": { -
      "multi-node": 14
    },
    "packaging_types": [ -
      { -
        "flavor": "default",
        "type": "rpm",
        "count": 14
      }
    ],
    "ingest": { -
      "number_of_pipelines": 0,
      "processor_stats": { -

      }
    },
    "indexing_pressure": { -
      "memory": { -
        "current": { -
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0
        },
        "total": { -
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0,
          "coordinating_rejections": 0,
          "primary_rejections": 0,
          "replica_rejections": 0
        },
        "limit_in_bytes": 0
      }
    }
  }
}

We use it for item search by building queries on indexed fields combined with LTR for relevance ranking. _source is not enabled.

We have about 10k queries per second when scaled like this and indexing rate of 13k documents per second.

DavidTurner · September 4, 2023, 8:47am

This'd explain it I think. By default Elasticsearch will throttle shard movements so as not to affect performance, but you're overriding the default behaviour with these two settings: you're basically telling Elasticsearch to ignore any possible performance impact of shard movements. But if you care about that performance impact, you probably want to revert this back to the default.

marinko · September 5, 2023, 7:27am

Thank you for the suggestion.

I tried removing these properties and unfortunately we still have latency peaks when an instance joins the cluster. It did noticeably increase the deployment time, as one would expect.

Do you maybe have any other ideas why this might be happening?

When looking at our cluster, I notice we (almost) always have all primaries on the same node. Could this be a bottleneck?

DavidTurner · September 5, 2023, 8:19am

I can't think of anything other than shard recoveries that might have a performance impact when a node joins the cluster.

I would not expect that, no. Is this something that only happens when adding a node?

marinko · September 5, 2023, 10:30am

No, distribution is like this all the time.

system · October 3, 2023, 10:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Increased Cluster Latency during adding of aliases with 30 nodes Elasticsearch	1	349	May 27, 2020
Query latency increase when upgrading from 7.10.2 Elasticsearch	1	339	September 2, 2021
Elasticsearch high latency Elasticsearch	16	3409	June 8, 2023
Latency and CPU spike on all nodes simultaneously Elasticsearch	1	641	February 17, 2017
High latency queries appear every 20 minutes Elasticsearch	3	543	October 22, 2020

Query latency spike when a node joins the cluster

Related topics