Using a must_not filter on 'drone' returns a document containing 'drone'

Using a must_not filter on 'drone' returns a document containing 'drone':

Cluster analyzer settings:

"analysis": {
      "filter": {
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      },
      "analyzer": {
        "eng_stemmer": {
          "tokenizer": "whitespace",
          "char_filter": [ "html_strip", "custom_char_filter_start_of_word", "custom_char_filter_end_of_word" ],
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stemmer"
          ]
        }
      },
      "char_filter": {
        "custom_char_filter_end_of_word": {
          "type": "pattern_replace",
          "pattern": "[\\W]+(\\s|$)",
          "replacement": " "
        },
        "custom_char_filter_start_of_word": {
          "type": "pattern_replace",
          "pattern": "(\\s|^)[\\W]+",
          "replacement": " "
        }
      }
  }

When using this query:

{
  "query": {
    "bool":{
      "must_not": [
        {
          "match": {
            "title": "drone"
          }
        }
      ],
      "must": [
        {
          "match": {
            "title": "wing"
          }
        }
      ]
    }
  }
}

I get a document with the following title in my results:

"title": "Hydrodynamic or streamlined profile for forming e.g. drone`s wing, has core comprising active section deformed under effect of variation of temperature of active layer inducing amplitude and direction deformation in zones of envelope",

I tried the /_analyze for both the search term and the title and both are stemmed to 'drone', yet the document is not filtered out of the result.

Am I missing something or is this a bug?

What is the mapping for the field title?

Is "drones" really analyzed as drone`? Could you share the output of:

GET /INDEX/_analyze
{
  "field" : "title",
  "text" : Hydrodynamic or streamlined profile for forming e.g. drone`s wing, has core comprising active section deformed under effect of variation of temperature of active layer inducing amplitude and direction deformation in zones of envelope"
}

Thank you for the reply.

The title field uses the 'eng_stemmer' analyzer.

I just redid the _analyze and found the issue (I must have overlooked it earlier):

drone's

is reduced to

drone'

(notice the single quote not being removed).

I expected the 'english_stemmer' or 'english_possessive_stemmer' to remove the quote as well as this is a regular occurence in the English language?

Edit: perhaps it's because it doesn't appear to be a standard single quote?

So that's the reason when searching for drone it can not match with "drone'"...

Instead of using a whitespace tokenizer, I'd use the standard one. Could you check that?

We can't use the standard tokenizer. We need to do a search on e.g. match "C/C" but the standard tokenizer removes the forward slash resulting in documents with "c c" in the result.

I checked the documentation and the standard tokenizer won't work I think anyway.

May be look at

Looks like replacing every

`

with a standard single quote (using a mapping char filter) seems to be working. "drone`s" is now correctly stemmed to "drone".

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.