What are Elasticsearch querying guarantees?

We're looking into extending our use of Elasticsearch into a new use cases and I wanted to get some clarity on what Elasticsearch can and cannot do. The one particular area I am concerned with is the data retrieval guarantees for queries. I specifically remember reading a while ago that Elasticsearch, under certain circumstances, will drop data from search results under certain conditions (e.g. low search score, etc.). However I cannot find any documentation about the behavior of data retrieval from queries. The use case we are looking at is the ability to extract all records that match a given set of constant score queries for the purposes of reporting. The data sets can range from a couple of thousands of documents to tens of millions. Ignoring performance & resource concerns, our requirement is that if the document exists in Elasticsearch that any paging search query or scrolling query will eventually return all documents that match the provided criteria. If it is true that due to the nature of how Elasticsearch works in that some search results may not be returned, this is important for us to know.

So my questions are, am I incorrect in my assumption that, in some situations, Elasticsearch does not guarantee that any given search query will always return all documents that will match, and where can i find documentation to support and explain this?

1 Like

Elasticsearch's sweet spot for these queries that have to touch all the matches is when you can use an aggregation to do the summarizing on the server side. Aggregations can generally by built using a column store built from the documents and are generally quite quick and have the "see all the documents" guarantees you want. Some aggregations trade accuracy for speed or memory but those are well documented. If you can't do it with an aggregation you can use a scroll search. It'll also see all of the documents but you'll have to pull them all back batch-wise so it'll be slower.

The thing you are thinking of, I think, is the Elasticsearch's standard queries only allow you to scroll so far. And they don't get an unchanging snapshot of the index. Scrolls don't have that problem, but they are more expensive to maintain because the snapshot has to be held open for longer.

Thank you for your quick response. In this use case, aggregations will not work for us. I think in our case, the scrolling API is what we are going to need.

The semi-specific thing i was thinking of in terms of Elasticsearch not returning all data queries was two things:

  • when indexing data and querying for said data, i know there is a latency from when the data is inserted to when its fully queryable based on many conditions. For example, when dealing with time series data, querying the head can be inconsistent depending on which node as the data vs which node is queried. There have been many times when testing ES that we get "quantum" results in that i know i've inserted 3 documents, but when i repeatedly query those documents, for a period of time, sometimes the query only returns 2 documents while other times returns 3 in seemingly random order until the document has be fully indexed in all shards (primary and replicas). This is not an issue for our current use case.
  • i read somewhere, but cannot find the document again so take this with a grain of salt, is due to the clustering nature of ES, a coordinating node spreads the query out to the various nodes with the data and then aggregates the results. This is why paginated searches over sorted data can be expensive as all data from the nodes needs to be aggregated on the coordinating node in order to sort before being reduced. The thing i remember reading was something, similar to the first bullet point, was that some nodes when producing scores for documents against any given search criteria MAY drop a document that would have produced a low score before returning the result to the coordinating node. The reason why it may have dropped the record could depend on different reasons but one of them I remember is if scoring process of all the documents of a node was going to take time longer than configured timeout period so it would drop the calculation of scores for documents it knew was going to provide a low score. And if a node doesnt return all documents, then the aggregated results returned from the coordinating node could also not contain the omitted data.

Both of these lead me down the path to searching for any documented behavior or guarantees around the various querying options in ES.

Does any of this make sense?

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.