Possible bug with deep pagination

We observed the following issue in production.

QUERY: Deep pagination with date filter and sorting on a particular field for an index. Index has two replicas (3 copies total) and 1 primary shard. Total shards = 3. ES version 5.1.1

ISSUE: Duplicates returned cross pages.
e.g. If page size is 100, Page 1 and Page 2 will return common documents. We can reproduce the issue on and off, which tells me that each shard returns result in different sorting order. Please note that there were no documents added to the index during this time.

QUERY for Page 40:

{
 "from" : 4000,
  "size" : 100,
  "query" : {
    "constant_score" : {
      "filter" : {
        "bool" : {
          "must" : [
            {
              "range" : {
                "endDate" : {
                  "from" : "2016-01-01T00:00:00.000Z",
                  "to" : "2017-01-01T00:00:00.000Z",
                  "include_lower" : true,
                  "include_upper" : true,
                  "boost" : 1.0
                }
              }
            },
            {
              "bool" : {
                "must" : [
                  {
                    "range" : {
                      "indexDate" : {
                        "from" : null,
                        "to" : null,
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  },
                  {
                    "bool" : {
                      "must_not" : [
                        {
                          "exists" : {
                            "field" : "mandatoryField",
                            "boost" : 1.0
                          }
                        }
                      ],
                      "disable_coord" : false,
                      "adjust_pure_negative" : true,
                      "boost" : 1.0
                    }
                  }
                ],
                "disable_coord" : false,
                "adjust_pure_negative" : true,
                "boost" : 1.0
              }
            }
          ],
          "disable_coord" : false,
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "documentId"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "recordDate" : {
        "order" : "asc"
      }
    }
  ],
  "ext" : { }
}

Query for Page 41:

{
 "from" : 4100,
  "size" : 100,
  "query" : {
    "constant_score" : {
      "filter" : {
        "bool" : {
          "must" : [
            {
              "range" : {
                "endDate" : {
                  "from" : "2016-01-01T00:00:00.000Z",
                  "to" : "2017-01-01T00:00:00.000Z",
                  "include_lower" : true,
                  "include_upper" : true,
                  "boost" : 1.0
                }
              }
            },
            {
              "bool" : {
                "must" : [
                  {
                    "range" : {
                      "indexDate" : {
                        "from" : null,
                        "to" : null,
                        "include_lower" : true,
                        "include_upper" : true,
                        "boost" : 1.0
                      }
                    }
                  },
                  {
                    "bool" : {
                      "must_not" : [
                        {
                          "exists" : {
                            "field" : "mandatoryField",
                            "boost" : 1.0
                          }
                        }
                      ],
                      "disable_coord" : false,
                      "adjust_pure_negative" : true,
                      "boost" : 1.0
                    }
                  }
                ],
                "disable_coord" : false,
                "adjust_pure_negative" : true,
                "boost" : 1.0
              }
            }
          ],
          "disable_coord" : false,
          "adjust_pure_negative" : true,
          "boost" : 1.0
        }
      },
      "boost" : 1.0
    }
  },
  "_source" : {
    "includes" : [
      "documentId"
    ],
    "excludes" : [ ]
  },
  "sort" : [
    {
      "recordDate" : {
        "order" : "asc"
      }
    }
  ],
  "ext" : { }
}

Response form page 40 and 41 will contain common documents. We had to run this query in a loop to reproduce it.

WHAT WE TRIED:

  1. After reducing the number of replicas to 0, we couldn't reproduce the issue.
  2. Then we increased the number of replicas to 2 and we still couldn't reproduce the issue.

This leads us to believe that somehow each shard has some discrepancy when an index is built over a period of time with more than 1 replica. Please let me know if this is a known issue because this is pretty serious bug.

I think I know whats going on here but i might take a little explaining so here goes...

When indexing each copy of a shard (primary and replicas) is sent the document to index but is left to actually perform the index operation and subsequent merge operations itself. This means that although the documents existing on each shard copy are the same, the segments on disk will differ. These differences will be due to refreshes happening at different times on the primary and replicas, the segments selected for merge being slightly different and/or documents marked for delete existing still in segments on one copy but not on another.

You are sorting by an field value but in the case where multiple hits have the same value for that field a tie breaker will be used. This tie breaker effectively amounts to the internal doc id assigned by Lucene to the document (an int). Because the segments may differ between shards copies, this internal doc id may be different for the same document on different shard copies and so can lead to differences in the sort on each shard copy.

The reason reducing the replicas to zero solves the issue is that when you only have one shard copy there is no difference in doc ids between the shard copies as there is only 1 copy to consider. When you increased the replicas again the primary shard's segments were copied to the new replicas and so all the segments were exactly the same between the primary and the replicas so you didn't have the problem of differing internal doc ids.

So this is not a bug with paginations but just a consequence of the design choices for how Elasticsearch replicates the index (document level replication rather than segment level).

One way to avoid hitting this is to add a second field sort criteria on a unique field in your document to act as your own tie-breaker.

Deep paging is also discouraged when using from and size in the search API because it gets very costly. You might want to look into search_after or scroll searches instead.

If your index is read-only, using the user name or session id as a preference could help as it will make sure that all pages for a given session are always served by the same shard. https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-request-preference.html

Colin,

Thank you so much for the explanation. We are already planning to switch to scroll / search_after. I guess I have a few follow up questions:

  1. How long does it take for a shard to perform index and subsequent merge operation (ms, seconds, hours)? After index/merge operation is complete, could each shard still return different documents because of the internal doc ids between the shard copies?
  2. By Refreshes, do you mean insert, update, delete operations?
  3. Would scroll / page_after guarantee the same results for each shard?

The reason I asked these questions is because, we suspected that ongoing refresh operations could impact the results returned for deep pagination. Hence, we ensured that there were no insert, delete or update operations in progress just to test our theory. We couldn't stop reads since the data was live in production.

Adrien,

Preference sounds like a great option to use, though our index is not read-only. It is constantly updated. Basically, we perform OLTP and OLAP on an index in real time.

The index operation on the primary and replicas is performed synchronously with the index request so when the index request returns the document is indexed on all shard copies. The merge operation is performed independently on each shard copy in the background when the shard copy determines that a merge is required. There is no set time for this merge operation to happen and it's probable that the resulting segments in the index will never be exactly the same between the shard copies (although the actual documents in each shard copy will be the same, the documents will be distributed differently among the shard copy's segments). This is because the replication is document based and not segment based.

No, I mean the refresh operation which writes the current in memory indexing buffer onto disk. The default refresh interval is 1 second but this is managed separately on each shard copy so the exact moment when the refresh occurs will be different on each shard copy resulting in slightly different segments for each shard copy. See here for more information on refreshes: Near Real-Time Search | Elasticsearch: The Definitive Guide [2.x] | Elastic

Scroll would guarantee the that there would be no duplicate documents between pages of results as the scroll will return to the same shard copies on each scroll request. It is also a point in time view of the index even between scroll requests for the same scroll so would not be sensitive to indexing whilst the scroll is in progress. It is however not suitable to expose to end users and is instead intended more for batch processing jobs.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.