How to softly exclude ambiguous words?


Because "apple" is hit, not just "apple care".How to implement this query_string?

Hello @S-Dragon0302

As per the table :

POST /fruits/_doc
{
  "message": ["usa", "apple"]
}

POST /fruits/_doc
{
  "message": ["usa", "apple", "banana"]
}

POST /fruits/_doc
{
  "message": ["usa", "apple", "apple care"]
}

POST /fruits/_doc
{
  "message": ["usa", "apple care"]
}

POST /fruits/_doc
{
  "message": ["usa", "apple", "banana", "apple care"]
}

#ambigous word = None
GET /fruits/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "usa" }},
        { "match": { "message": "apple" }}
      ],
      "must_not": [
        { "match": { "message": "banana" }}
      ]
    }
  }
}

#ambigous word = apple care
GET /fruits/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "usa" }},
        { "match": { "message": "apple" }}
      ],
      "must_not": [
        { "match": { "message": "banana" }}
      ],
      "filter": {
        "script": {
          "script": {
            "source": """
              def kws = doc['message.keyword'].size() == 0 ? [] : doc['message.keyword'];
              return !kws.contains('apple care') || kws.contains('apple');
            """,
            "lang": "painless"
          }
        }
      }
    }
  }
}

Thanks!!

Using scripts can achieve this, but the performance is too poor. I have hundreds of billions of data entries, and the storage is at the PB level.

Maybe of interest: in this demo users can choose which interpretation of the ambiguous search term “ice” they want and search for only that. This uses clustering and binary vectors so may require changes to both indexing and user behaviours but does solve this sort of problem.