Is it possible to eliminate duplication of search response when using nested query?

Here's the sample of mapping, register, and search query.

mapping

curl -X PUT "es:9200/english" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "title" : {
          "type" : "text"
        },
        "contents": {
          "type": "nested"
        }
      }
    }
  }
}
'

register

curl -X PUT "es:9200/english/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "title": "Test title",
  "contents": [
    {
      "header": "something special",
      "body": "I am John."
    },
    {
      "header": "anything hot",
      "body": "This is a cup."
    }
  ]
}
'

curl -X PUT "es:9200/english/_doc/2?refresh" -H 'Content-Type: application/json' -d'
{
  "title": "Test title",
  "contents": [
    {
      "header": "something special",
      "body": "I am John."
    },
    {
      "header": "anything hot",
      "body": "That is a glass."
    }
  ]
}
'

search

curl -XGET "es:9200/english/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "_source": 'false',
  "size": 20,
  "query": {
    "nested": {
      "path": "contents",
      "score_mode": "max",
      "query": {
          "simple_query_string":{
          "query": "I am",
          "fields": ["contents.header","contents.body"],
          "auto_generate_synonyms_phrase_query": 'true'
        }
      },
      "inner_hits": {
        "size": 1
      }
    }
  }
}
'

result

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.4723401,
    "hits" : [
      {
        "_index" : "english",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.4723401,
        "inner_hits" : {
          "contents" : {
            "hits" : {
              "total" : 1,
              "max_score" : 1.4723401,
              "hits" : [
                {
                  "_index" : "english",
                  "_type" : "_doc",
                  "_id" : "2",
                  "_nested" : {
                    "field" : "contents",
                    "offset" : 0
                  },
                  "_score" : 1.4723401,
                  "_source" : {
                    "header" : "something special",
                    "body" : "I am John."
                  }
                }
              ]
            }
          }
        }
      },
      {
        "_index" : "english",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.4723401,
        "inner_hits" : {
          "contents" : {
            "hits" : {
              "total" : 1,
              "max_score" : 1.4723401,
              "hits" : [
                {
                  "_index" : "english",
                  "_type" : "_doc",
                  "_id" : "1",
                  "_nested" : {
                    "field" : "contents",
                    "offset" : 0
                  },
                  "_score" : 1.4723401,
                  "_source" : {
                    "header" : "something special",
                    "body" : "I am John."
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

In the search response, there is two hits. And two is same content except for "_id".

Then I would like to remove one hit which is similar to another.

If someone have good solution for it, please help me...!!


The following solution of eliminating duplication "Field Collapsing" doesn't seems to be fit in using nested query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-collapse.html

You should may be do that at index time and basically index only one document.

1 Like

Thank you for your reply.

I'll try to change way of indexing only one document.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.