Hi everyone,
We’re currently facing some serious challenges with our Elasticsearch deployment, and we’re hoping to get some advice or suggestions from the community. Below is the technical context of our situation and what we’ve tried so far. We would really appreciate any insights or ideas that might help us improve performance.
Context:
We are working with a large dataset, consisting of millions of complex documents. These documents contain multiple levels of nested fields and also use parent-child relationships. Additionally, we are mixing stale data (pre-indexed and less frequently updated) with real-time customer data. It’s essential that our search queries reflect the customer data in real time, even when combined with stale data, allowing users to perform complex queries using both.
The Problem:
-
Syncing Latency:
Syncing documents into Elasticsearch is extremely slow, particularly when syncing large batches of stale data. This is especially painful in scenarios where real-time customer data must still be synced alongside the stale data, leading to higher latencies. In these cases, customer data suffers from increased delays due to the load of syncing stale data. -
Search Latency:
In order to avoid syncing latency, we tried not syncing the customer data and instead injecting the data directly into the query itself. In this approach, we pre-calculate which documents need to be preselected and apply further queries on top of those. However, this method has led to significant latencies when performing searches, especially across large datasets. For example, when searching across a set of 10,000 items, query times are typically between 1 to 3 seconds. We need to ensure that users can execute mixed queries on stale and real-time data with sub-second latency, but this has proven difficult to achieve.
What We’ve Tried:
-
Turning Off Data Refresh During Stale Syncs:
We attempted to turn off automatic data refresh during large stale data syncs to reduce latency. However, we forced refreshes for customer data to ensure real-time availability, which resulted in syncing stale data still in staging, leading to large latencies. This workaround did not provide the improvement we had hoped for. -
Query Expansion:
We have tried injecting large sets of data (e.g., IDs and filters) directly into terms queries, but ran into the limitation of the number of terms we can inject. While this limit can be customized in the cluster configuration, there is still an upper bound that restricts scalability. We also experimented with terms lookup, which did reduce latency, but the limitation on the number of terms remains a challenge. -
Scoring Mechanisms:
Since we need to sort by customer data, when injecting customer data into the query, we also had to customize the document score for proper sorting. We tried using function score and script score mechanisms, but both approaches were slow, with latencies exceeding 1 second for 10,000 documents. This continues to be a major bottleneck.
Our Goal:
We’re looking for technical suggestions on how to optimize both the syncing and querying processes given the complexity of our dataset. Specifically:
- Are there alternative approaches to injecting large amounts of customer-specific data into queries while maintaining fast performance?
- How can we optimize indexing and searching for documents with multiple levels of nested fields and parent-child relationships?
- Are there best practices for handling mixed queries that combine stale and real-time customer data without impacting performance?
We appreciate any insights or suggestions from the community and look forward to your thoughts!