Duplicate results

Elasticsearch.... . 7.4

Status... Duplicate results were seen in pagination search results when there were multiple data nodes, regardless of whether there was a dedicated master or not (the same was true when specifying a shard ID or custom string in preference). When index was restored from a snapshot, no duplication was found.

Questions
(1) I would like to configure multiple data nodes to prevent duplication during pagination search and to ensure availability. How can I do this? (Solution by configuration, system flow, etc.)
(2) At the time of verification, there was no duplication when index was restored from snapshot, but is this the case in the specification? (If so, I think it can be achieved by separating them in the read/write index.)

Welcome to our community! :smiley:
7.4 is EOL and no longer supported, please upgrade. 7.16 is latest and 8.0 is currently in alpha :slight_smile:

That said, you'd need to share your query and it's results, and document samples, to allow us to comment further.

Thanks for the reply.
We will consider the upgrade separately.

The document sample, query and result samples are shown below.
However, the key name is replaced with "v_", the string with "xxxxx", and the number with "yyyyy".

Document Sample

{
  "_index": "xxxxx",
  "_type": "_doc",
  "_id": "xxxxx",
  "_version": 6,
  "_seq_no": 1246083,
  "_primary_term": 2,
  "found": true,
  "_source": {
    "v_1": "xxxxx",
    "v_8": yyyyy,
    "v_3": "xxxxx",
    "v_10": yyyyy,
    "v_11": yyyyy,
    "f_12": true,
    "v_7": "xxxxx",
    "v_2": "xxxxx",
    "v_5": "xxxxx",
    "v_4": "xxxxx",
    "v_9": yyyyy,
    "v_6": "xxxxx",
    "v_13": {
      "v_14": null,
      "v_15": null
    },
  }
}

Request Sample

{
  "track_total_hits": True,
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "xxxxxx",
            "type": "phrase",
            "fields": [
              "v_1^1.5",
              "v_2",
              "v_3",
              "v_4",
              "v_5",
              "v_6^0.9",
              "v_7"
            ],
            "analyzer": "ja_analyzer",
            "slop": 2
          }
        }
      ],
      "filter": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_8": {
                        "gt": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_8": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_9": {
                        "gt": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_9": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_10": {
                        "gte": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_10": -1
                    }
                  }
                ]
              }
            },
            {
              "bool": {
                "should": [
                  {
                    "range": {
                      "v_11": {
                        "gte": 0
                      }
                    }
                  },
                  {
                    "match": {
                      "v_11": -1
                    }
                  }
                ]
              }
            },
            {
              "term": {
                "v_12": True
              }
            },
            {
              "bool": {
                "should": []
              }
            },
            {
              "bool": {
                "should": []
              }
            },
            {
              "bool": {
                "should": []
              }
            }
          ]
        }
      }
    }
  },
  "from": 0,
  "size": per,
  "highlight": {
    "pre_tags": [
      "<mark>"
    ],
    "post_tags": [
      "</mark>"
    ],
    "fields": {
      "v_6": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_7": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_5": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_4": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_2": {
        "number_of_fragments": 1,
        "fragment_size": 150
      },
      "v_3": {
        "number_of_fragments": 1,
        "fragment_size": 150
      }
    }
  },
  "_source": {
    "excludes": [
      "v_6",
      "v_13"
    ]
  }
}

Result Sample

{
  "took": 207,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8557,
      "relation": "eq"
    },
    "max_score": 10.259903,
    "hits": [
      {
        "_index": "xxxxx",
        "_type": "_doc",
        "_id": "xxxxx",
        "_score": 10.259903,
        "_source": {
          "v_8": yyyyy,
          "v_7": "xxxxx",
          "v_5": "xxxxx",
          "v_3": "xxxxx",
          "v_1": "xxxxx",

          "v_4": "xxxxx",
          "v_9": yyyyy,
          "v_10": yyyyy,
          "v_12": true,
          "v_11": yyyyy,
          "v_2": "xxxx",
        },
        "highlight": {
          "v_5": [
            "xxxxx"
          ],
          "v_6": [
            "xxxxx"
          ],
          "v_3": [
            "xxxxx"
          ],
          "v_2": [
            "xxxxx"
          ]
        }
      },
      .....
    ]
  }
}

I suppose you need to share query for pagination.

I get it in python as follows.

import urllib.request
import json
import copy

per = 20
query={....}
headers = {
    'Content-Type': 'application/json',
}


for i in range(0, 50):
    q = copy.deepcopy(query)
    q['from'] = i * per
    json_data = json.dumps(q).encode('utf-8')
    req = urllib.request.Request(
            url,
            data=json_data,
            headers=headers,
            method='GET'
        )
    res = urllib.request.urlopen(req)

Your query doesn't store status of the index, and the sorted order recreated on the every queries.

When some refresh occurs between queries, the sorted order changes, as explained in the first paragraph of Point in time API. Point in time API is the uptodate way to cope with the problem.

Although it is already not recommended function in 8.0, Scroll may help you.

Sorry, this time the situation is as follows.

(1) For a single node... No duplication
(2)In the case of multiple nodes
(i)When index is restored... No duplication occurs.
(ii)When an index is created... Duplication will occur.

In the case of (ii), there is no update of index when searching.

What we want to know is
(a) Why does this happen (difference in the number of nodes, difference in restore/create)?
(b) How to avoid duplication when retrieving multiple nodes and pagination.

Of course, we understand that the version we are using is old, so we are considering upgrading separately.

(a) I'm not sure but I suppose getting not duplicated results by from query was only by chance and it is not to be expected.

(b) explained in the previous post. Try Scroll.

Sample documents and other information can be found above.
I've also included what I'd like to know about the current situation.

If you have any advice, it would be greatly appreciated.

I would be grateful for any advice anyone can give me.
As a motivation for this project, a search in pagination is a must.

You should look at Point in time API | Elasticsearch Guide [8.0] | Elastic for consistent pagination IMHO.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.