Unexplainable slow queries on Elasticsearch 😔

Hi community,
I’m asking for help today because we’re experiencing huge latencies on some Elasticsearch queries, and we’re in a lack of ideas to explore to find the actual source of the trouble.

We are not looking for immediate solutions (even if we can dream), but clues we can explore to understand the issue and resolve it.

Issue:

When, most of the time, our search queries give results in less than a second, some of our queries can take more than 5 seconds, more than 10 seconds, more than 20 seconds without any particular reason.
The query is always the same, only the keywords change, and we did not identify specific keywords that could explain such huge responses times.

And here’s the most blocking thing, we are not able to reproduce the issue easily, because we are unable to identify scenarios invoking those latencies.

Search context:

We have a Spring Boot application that executes our search queries using Elasticsearch Java Client.
We use only one query (that you can find at the end of the post), and for now, the query is run only a few times per hour. So, this is definitely not load issues.

We also did manual testing using Kibana, and observed a similar behaviour.

Indexation context:

We have a partial indexation (~5% documents) once a day, then we have a full indexation once a week. It is mostly document updates, but we can also have a few creations/deletions.
Our indexations are made during not-business hours, to be sure that index requests do not impact search requests performance.

Benchmark context:

We benchmark with Locust.
We use a dictionary of ~2000 queries, built from actual queries run by our clients on the main application.
And here’s the most weird thing. In production, or when we do manual tests, we run a few queries, and we sometimes have results in 1~20 seconds..
But when we use benchmarking tools, and we stress Elasticsearch up to 50 queries per seconds, Elasticsearch responds in less than 300ms.

What we already tried, without success:

  • force_merge index, in case it was fragmentation issue
  • let Elastic generate its own document IDs (because we override document IDs as indexing phase), in case it was “advance” phase latency issue
  • preload data (index.store.preload: ["nvd", "dvd"]), in case it was warm-up issue
  • increase index.search.idle.after setting to 1 hour
  • tried to simplify our hosting setup (stop using docker, stop using Openstack volumes, and run directly on our Openstack VMs with local hard disks)

What we already tried, that seems to help:

  • we have a script that run random search query every minute (using benchmark data), to keep Elastic “warm” → it seems that keeping Elasticsearch warm reduce latencies around 2~5 seconds maximum

Specs:

This is the first time I post in this community, I’ll try to be as exhaustive as possible, don’t hesitate to point out missing details.

Version: Elasticsearch 7.6.2

Hosting:

OVH Public Cloud (Openstack)
Docker Swarm is our goal, but we tested and observed the same results without Docker. 
Debian 9.13
openjdk 16
Linux kernel 4.9.0

Sizing:

For the simplified hosting setup:
3 nodes OVH type b2-30
8 vcores per server
30GB RAM
SSD
500Mbps guaranteed/up to 2Gbps local network
These nodes were dedicated to ElasticSearch.

Index:

health status index                      uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   our-index-v1.1 6yUxerMtTViMlAtOO498Jg   3   1   12830216      2976961     31.7gb           16gb

Index settings:

{
  "our-index-v1.1" : {
    "settings" : {
      "index" : {
        "refresh_interval" : "1s",
        "number_of_shards" : "3",
        "number_of_replicas" : "1",
        "store" : {
          "preload" : [
            "nvd",
            "dvd"
          ]
        },
        "uuid" : "6yUxerMtTViMlAtOO498Jg"
      }
    }
  }
}

Index mapping:

{
  "our-index-v1.1" : {
    "mappings" : {
      "properties" : {
        "matchingGroups" : {
          "properties" : {
            "address" : {
              "type" : "text"
            },
            "naming" : {
              "type" : "text"
            },
            "onlineData" : {
              "type" : "text",
              "analyzer" : "french"
            },
            "openData" : {
              "type" : "text",
              "analyzer" : "french"
            },
            "internalData" : {
              "type" : "text",
              "analyzer" : "french"
            }
          }
        },

        /* 
         * Following: Fields that are copied (copy_to) to matchingGroups up there.
         */
        "example1": {
          "type" : "text",
          "index" : false,
          "copy_to" : [
            "matchingGroups.naming"
          ]
        },
        "example2": {
          "type" : "keyword",
          "index" : false,
          "copy_to" : [
            "matchingGroups.naming"
          ]
        }
        /* 
         * ... And more fields ... 
         */
      }
    }
  }
}

Search query:

GET /our-index-v1.1/_search/template
{
  "params": {
    "value_to_search": "spécialiste en poterie"
  },
  "source": {
    "from": 0,
    "size": 100,
    "query": {
      "bool": {
        "must": [
          {
            "bool": {
              "should": [
                {
                  "match_phrase": {
                    "matchingGroups.naming": {
                      "query": "{{value_to_search}}",
                      "slop": 0,
                      "boost": 20
                    }
                  }
                },
                {
                  "match": {
                    "matchingGroups.naming": {
                      "query": "{{value_to_search}}",
                      "boost": 6
                    }
                  }
                },
                {
                  "match_phrase": {
                    "matchingGroups.openData": {
                      "query": "{{value_to_search}}",
                      "slop": 0,
                      "boost": 20
                    }
                  }
                },
                {
                  "match": {
                    "matchingGroups.openData": {
                      "query": "{{value_to_search}}",
                      "boost": 6
                    }
                  }
                },
                {
                  "match_phrase": {
                    "matchingGroups.onlineData": {
                      "query": "{{value_to_search}}",
                      "slop": 0,
                      "boost": 20
                    }
                  }
                },
                {
                  "match": {
                    "matchingGroups.onlineData": {
                      "query": "{{value_to_search}}",
                      "boost": 6
                    }
                  }
                },
                {
                  "match_phrase": {
                    "matchingGroups.internalData": {
                      "query": "{{value_to_search}}",
                      "slop": 0,
                      "boost": 20
                    }
                  }
                },
                {
                  "match": {
                    "matchingGroups.internalData": {
                      "query": "{{value_to_search}}",
                      "boost": 6
                    }
                  }
                }
              ]
            }
          }
        ],
        "must_not": [
          {
            "term": {
              "customFilter.code": {
                "value": "1000"
              }
            }
          },
          {
            "term": {
              "customFilter.code": {
                "value": "5485"
              }
            }
          },
          {
            "term": {
              "customFilter.code": {
                "value": "6540"
              }
            }
          },
          {
            "term": {
              "customFilter.code": {
                "value": "6589"
              }
            }
          },
          {
            "term": {
              "customFilter.code": {
                "value": "6599"
              }
            }
          },
          {
            "term": {
              "customFilter.code": {
                "value": "9110"
              }
            }
          }
        ]
      }
    }
  }
}

Do these happen during the day? If so, is there any correlation between these and slow queries?

No, we're doing updates only during hours when our Elastic is not used.
To be sure performances are not reduced due to those creations/deletions.

Up :arrow_up:

We really are in a dead-end, looking for new ideas :bulb: