Seeking Assistance with Syncing and Search Latency Issues in Elasticsearch

Hi everyone,

We’re currently facing some serious challenges with our Elasticsearch deployment, and we’re hoping to get some advice or suggestions from the community. Below is the technical context of our situation and what we’ve tried so far. We would really appreciate any insights or ideas that might help us improve performance.

Context:

We are working with a large dataset, consisting of millions of complex documents. These documents contain multiple levels of nested fields and also use parent-child relationships. Additionally, we are mixing stale data (pre-indexed and less frequently updated) with real-time customer data. It’s essential that our search queries reflect the customer data in real time, even when combined with stale data, allowing users to perform complex queries using both.

The Problem:

  1. Syncing Latency:
    Syncing documents into Elasticsearch is extremely slow, particularly when syncing large batches of stale data. This is especially painful in scenarios where real-time customer data must still be synced alongside the stale data, leading to higher latencies. In these cases, customer data suffers from increased delays due to the load of syncing stale data.

  2. Search Latency:
    In order to avoid syncing latency, we tried not syncing the customer data and instead injecting the data directly into the query itself. In this approach, we pre-calculate which documents need to be preselected and apply further queries on top of those. However, this method has led to significant latencies when performing searches, especially across large datasets. For example, when searching across a set of 10,000 items, query times are typically between 1 to 3 seconds. We need to ensure that users can execute mixed queries on stale and real-time data with sub-second latency, but this has proven difficult to achieve.

What We’ve Tried:

  • Turning Off Data Refresh During Stale Syncs:
    We attempted to turn off automatic data refresh during large stale data syncs to reduce latency. However, we forced refreshes for customer data to ensure real-time availability, which resulted in syncing stale data still in staging, leading to large latencies. This workaround did not provide the improvement we had hoped for.

  • Query Expansion:
    We have tried injecting large sets of data (e.g., IDs and filters) directly into terms queries, but ran into the limitation of the number of terms we can inject. While this limit can be customized in the cluster configuration, there is still an upper bound that restricts scalability. We also experimented with terms lookup, which did reduce latency, but the limitation on the number of terms remains a challenge.

  • Scoring Mechanisms:
    Since we need to sort by customer data, when injecting customer data into the query, we also had to customize the document score for proper sorting. We tried using function score and script score mechanisms, but both approaches were slow, with latencies exceeding 1 second for 10,000 documents. This continues to be a major bottleneck.

Our Goal:

We’re looking for technical suggestions on how to optimize both the syncing and querying processes given the complexity of our dataset. Specifically:

  • Are there alternative approaches to injecting large amounts of customer-specific data into queries while maintaining fast performance?
  • How can we optimize indexing and searching for documents with multiple levels of nested fields and parent-child relationships?
  • Are there best practices for handling mixed queries that combine stale and real-time customer data without impacting performance?

We appreciate any insights or suggestions from the community and look forward to your thoughts!

You have not told us anything about the cluster, so it would be helpful if you could answer the following questions:

  1. Which version of Elasticsearch are you using?
  2. How many indices and shards is your data set indexed into? What is the total size of all shards in the cluster?
  3. How many nodes do you have in the cluster? How are these configured?
  4. What is the specification of the nodes in the cluster? What type of hardware is the cluster deployed on?
  5. What type of storage are you using? It sounds like a lot of what you are doing could result in high IOPS so local fast SSDs would be recommended.

These do come with a lot of overhead and often result in longer query latencies. They can also result in additional overhead at indexing time.

Elasticsearch is not designed mor optimised to be a real-time search engine. Are you updating individual documents or are you executing updates in large bulk requests? As you mention real-time as a requirement, does this by any chance mean that you are refreshing after each update or indexing request?

Frequently updating or adding/deleting from large nested documents can add a lot of overhead as the full document need to be reindexed. It would here be useful to know how you have separated the static parts from the more dynamic ones that are updated in near real-time.

As you seem to be using a lot of features that come with significant overhead I suspect it might be necessary to review how you are modelling your data, so additional details around the document/index structure and query patterns would be useful.

Thanks @Christian_Dahlqvist for your reply!

  1. Which version of Elasticsearch are you using?

8.4.1

  1. How many indices and shards is your data set indexed into? What is the total size of all shards in the cluster?

There are two main and independent indices that this issue apply to:

  • Index A:
    • Total size: 813.3gb
    • 32 shards with 1 replica
  • Index B:
    • Total size: 2tb
    • 32 shards with 1 replica
  1. How many nodes do you have in the cluster? How are these configured?

There are 4 nodes. What configuration specifically do you need?

  1. What is the specification of the nodes in the cluster? What type of hardware is the cluster deployed on?

The cluster is deployed to Elastic cloud. The hardware of the nodes is:

  • CPU optimized
  • 256GB RAM
  • 128 vCPU
  1. What type of storage are you using? It sounds like a lot of what you are doing could result in high IOPS so local fast SSDs would be recommended.

Our Elastic cloud deployment is hosted in GCP, which I believe uses SSD.

Elasticsearch is not designed mor optimised to be a real-time search engine

Yeah, we are coming to this realization too.

Are you updating individual documents or are you executing updates in large bulk requests?

We are doing bulk requests

As you mention real-time as a requirement, does this by any chance mean that you are refreshing after each update or indexing request?

Yeah we are refreshing after the updates that are related to real time data. The more stale data batches turns off refresh before the batch update, then turns it on after.

It would here be useful to know how you have separated the static parts from the more dynamic ones that are updated in near real-time.

In our mapping, the real time data is isolated into a children document related to the parent document which represents the stale data.

I suspect it might be necessary to review how you are modelling your data, so additional details around the document/index structure and query patterns would be useful.

I would like to avoid sharing the mapping specs publicly, but this is an example of a query that injects the document ids as terms in the queries to avoid syncing real time data while still support sorting:

def generate_custom_query_using_script_score(id_score_map):
    query = {
        "_source": ["id"],
        "query": {
            "script_score": {
                "query": {
                    "bool": {
                        "must": [
                            {"terms": {"id": list(id_score_map.keys())}},
                        ]
                    }
                },
                "script": {
                    "params": {"companyMap": id_score_map},
                    "source": """
                        def company_id = String.valueOf(doc['id'].value);
                        def score = params.scores.get(company_id);
                        if (score == null) {
                            return 0;
                        }
                        return score;
                    """,
                },
            }
        },
        "sort": [{"_score": {"order": "desc"}}],
        "from": 0,
        "size": 25,
    }
    return query

We tried also a silimilar approach with lookups:

def generate_custom_query_using_script_score(
    watchlist_id: str, association_scores: dict
):
    return {
        "query": {
            "function_score": {
                "query": {
                    "terms": {
                        "id": {
                            "index": "company_watchlist_associations",
                            "id": watchlist_id,
                            "path": "company_ids",
                        }
                    }
                },
                "script_score": {
                    "script": {
                        "source": """
                        def company_id = String.valueOf(doc['id'].value);
                        def score = params.scores.get(company_id);
                        if (score == null) {
                            return 0;
                        }
                        return score;
                        """,
                        "params": {"scores": association_scores},
                    }
                },
                "boost_mode": "replace",
            }
        },
        "_source": ["id"],
        "sort": [{"_score": {"order": "desc"}}],
        "from": 0,
        "size": 25,
    }

Notice that those associations and dictionaries can have more than 10K entries in them to be injected in the queries

I would recommend you upgrade to the latest version to get any potential performance improvements.

Passing large amounts of terms in and sorting using scripts can also be quite slow. I think this may be the first use case I have come across in a long time that uses almost all the things I try to avoid as they slow down queries with the exception of wildcard queries.

I do not see any things here that could be done that would give any meaningful and immediate performance boost, so suspect one would need to try to rethink the data model and associated queries. That does however require an in depth analysis that is quite outside the scope of this forum.

Thanks @Christian_Dahlqvist for your help

@Christian_Dahlqvist we are considering creating our own Elasticsearch java plugin to customize the score calculation, do you think that would significantly increase the performance on how we are calculating the scores on those queries? Do you have detail information on how to develop Elasticsearch java plugin other than this documentation here?

I do not know to what extent the score calculation is affecting latencies. Given the features you are using I would be surprised if that was a major contributor. Before diving into this I would recommend profiling the query to see which parts are contributing the most.