Duplicate results

motoki · January 31, 2022, 8:45am

Elasticsearch.... . 7.4

Status... Duplicate results were seen in pagination search results when there were multiple data nodes, regardless of whether there was a dedicated master or not (the same was true when specifying a shard ID or custom string in preference). When index was restored from a snapshot, no duplication was found.

Questions
(1) I would like to configure multiple data nodes to prevent duplication during pagination search and to ensure availability. How can I do this? (Solution by configuration, system flow, etc.)
(2) At the time of verification, there was no duplication when index was restored from snapshot, but is this the case in the specification? (If so, I think it can be achieved by separating them in the read/write index.)

warkolm · January 31, 2022, 9:38am

Welcome to our community!
7.4 is EOL and no longer supported, please upgrade. 7.16 is latest and 8.0 is currently in alpha

That said, you'd need to share your query and it's results, and document samples, to allow us to comment further.

motoki · February 11, 2022, 7:32am

Thanks for the reply.
We will consider the upgrade separately.

The document sample, query and result samples are shown below.
However, the key name is replaced with "v_", the string with "xxxxx", and the number with "yyyyy".

Document Sample

{
  "_index": "xxxxx",
  "_type": "_doc",
  "_id": "xxxxx",
  "_version": 6,
  "_seq_no": 1246083,
  "_primary_term": 2,
  "found": true,
  "_source": {
    "v_1": "xxxxx",
    "v_8": yyyyy,
    "v_3": "xxxxx",
    "v_10": yyyyy,
    "v_11": yyyyy,
    "f_12": true,
    "v_7": "xxxxx",
    "v_2": "xxxxx",
    "v_5": "xxxxx",
    "v_4": "xxxxx",
    "v_9": yyyyy,
    "v_6": "xxxxx",
    "v_13": {
      "v_14": null,
      "v_15": null
    },
  }
}

Request Sample

{
  "track_total_hits": True,
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "xxxxxx",
            "type": "phrase",
            "fields": [
              "v_1^1.5",
              "v_2",
              "v_3",
              "v_4",
              "v_5",
              "v_6^0.9",
              "v_7"
            ],
            "analyzer": "ja_analyzer",
            "slop": 2
          }
        }
      ],
      "filter": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_8": {
                        "gt": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_8": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_9": {
                        "gt": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_9": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_10": {
                        "gte": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_10": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_11": {
                        "gte": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_11": -1
                    }
                  }
                ]
              }
            },
            {
              "term": {
                "v_12": True
              }
            },
            {
              "bool": {
                "should": []
              }
            },
            {
              "bool": {
                "should": []
              }
            },
            {
              "bool": {
                "should": []
              }
            }
          ]
        }
      }
    }
  },
  "from": 0,
  "size": per,
  "highlight": {
    "pre_tags": [
      "<mark>"
    ],
    "post_tags": [
      "</mark>"
    ],
    "fields": {
      "v_6": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_7": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_5": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_4": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_2": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_3": {
        "number_of_fragments": 1,
        "fragment_size": 150
      }
    }
  },
  "_source": {
    "excludes": [
      "v_6",
      "v_13"
    ]
  }
}

Result Sample

{
  "took": 207,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8557,
      "relation": "eq"
    },
    "max_score": 10.259903,
    "hits": [
      {
        "_index": "xxxxx",
        "_type": "_doc",
        "_id": "xxxxx",
        "_score": 10.259903,
        "_source": {
          "v_8": yyyyy,
          "v_7": "xxxxx",
          "v_5": "xxxxx",
          "v_3": "xxxxx",
          "v_1": "xxxxx",

          "v_4": "xxxxx",
          "v_9": yyyyy,
          "v_10": yyyyy,
          "v_12": true,
          "v_11": yyyyy,
          "v_2": "xxxx",
        },
        "highlight": {
          "v_5": [
            "xxxxx"
          ],
          "v_6": [
            "xxxxx"
          ],
          "v_3": [
            "xxxxx"
          ],
          "v_2": [
            "xxxxx"
          ]
        }
      },
      .....
    ]
  }
}

Tomo_M · February 12, 2022, 1:49pm

I suppose you need to share query for pagination.

motoki · February 14, 2022, 7:51am

I get it in python as follows.

import urllib.request
import json
import copy

per = 20
query={....}
headers = {
    'Content-Type': 'application/json',
}


for i in range(0, 50):
    q = copy.deepcopy(query)
    q['from'] = i * per
    json_data = json.dumps(q).encode('utf-8')
    req = urllib.request.Request(
            url,
            data=json_data,
            headers=headers,
            method='GET'
        )
    res = urllib.request.urlopen(req)

Tomo_M · February 14, 2022, 8:49am

Your query doesn't store status of the index, and the sorted order recreated on the every queries.

When some refresh occurs between queries, the sorted order changes, as explained in the first paragraph of Point in time API. Point in time API is the uptodate way to cope with the problem.

Although it is already not recommended function in 8.0, Scroll may help you.

motoki · February 15, 2022, 2:21am

Sorry, this time the situation is as follows.

(1) For a single node... No duplication
(2)In the case of multiple nodes
(i)When index is restored... No duplication occurs.
(ii)When an index is created... Duplication will occur.

In the case of (ii), there is no update of index when searching.

What we want to know is
(a) Why does this happen (difference in the number of nodes, difference in restore/create)?
(b) How to avoid duplication when retrieving multiple nodes and pagination.

Of course, we understand that the version we are using is old, so we are considering upgrading separately.

Tomo_M · February 15, 2022, 3:10am

(a) I'm not sure but I suppose getting not duplicated results by from query was only by chance and it is not to be expected.

(b) explained in the previous post. Try Scroll.

motoki · February 17, 2022, 8:40am

Sample documents and other information can be found above.
I've also included what I'd like to know about the current situation.

If you have any advice, it would be greatly appreciated.

motoki · February 25, 2022, 6:59am

I would be grateful for any advice anyone can give me.
As a motivation for this project, a search in pagination is a must.

dadoonet · February 25, 2022, 7:12am

You should look at Point in time API | Elasticsearch Guide [8.0] | Elastic for consistent pagination IMHO.

system · March 25, 2022, 7:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Duplicate results when paging Elasticsearch	3	1053	February 10, 2021
Duplicate results in search of index via alias after restoring snapshot of index to a new name Elasticsearch	3	3005	October 13, 2017
Duplicate documents in paginated query results Elasticsearch	4	6813	July 5, 2017
Duplicate content returned while paginating Elasticsearch	3	337	July 26, 2018
Duplicate documents with exactly same index/type/id in ES5.4.2 Elasticsearch	3	596	December 14, 2017

Duplicate results

Related topics