Is it possible to eliminate duplication of search response when dealing with so long text?

mapping

PUT /my_index/_mapping/blogpost

{
  "properties": {
    “document”: {
      "properties": {
        "content": { 
          "type": "text",
          "fields": {
            "raw": { 
              "type": "keyword",
              "ignore_above": 32766
            }
          }
        }
      }
    }
  }
}

register

POST /my_index/_doc/1

{
  "contents": "so long text which exceeds 32766 bytes"
}

search

GET /my_index/_search

{
    "query": {
        "match": {
            "contents": "somothing"
        }
    },
    "collapse" : {
        "field" : "contents.row" 
    }
}

Using this mapping, when searching text which is smaller than 32766 bytes,
I can eliminate duplication of search response.

However, when searching text which is larger than 32766 bytes, I can't.

Is there another way of solving my requirements?

A hash of the content would be a shorter value to de-duplicate on but while it will have no false negatives (it will always recognise a duplicate text) it can have a small number of false positives (declaring non-duplicate texts as identical)

1 Like

Thank you very much.

By using a hash of content, I could resolve the problem!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.