Problem understanding phrase matching with stop words


(Janaka Bandara) #1

Hi,
I have mapped an index as follows

PUT test
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "analysis": {
        "analyzer": {
          "english": {
            "tokenizer": "standard",
            "filter": ["trim", "lowercase", "english_possessive_stemmer", "my_stop", "light_english_stemmer","shingle"]
          }
        },
        "filter": {
          "english_possessive_stemmer":{
            "type" : "stemmer",
            "language" : "possessive_english"
          },
          "light_english_stemmer":{
            "type" : "stemmer",
            "language" : "light_english"
          },
          "my_stop":{
            "type" : "stop",
            "stopwords" : ["what", "is"],
            "remove_trailing" : "false"
          }
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

And inserted following documents

POST test/test?refresh=true
{"title": "hilton colombo"}
POST test/test?refresh=true
{"title": "colombo hilton"}

Now when I search for "what is hilton colombo" I do not get any results.

GET test/test/_search
{
  "query": {
    "match": {
      "title": {
        "query": "what is hilton colombo",
        "analyzer": "english",
        "operator": "and"
      }
    }
  }
}

But when I analyze the query I can see the "hilton colombo" as a token

GET test/_analyze
{
  "analyzer": "english",
  "text": "what is hilton colombo"
}

Shouldn't the search analyzer remove any stop words and match the rest of the words against the document titles?

What am I missing here?

Can anyone please help me to figure this one out? Thank you :slightly_smiling_face:


(Abdon Pijpelink) #2

What's causing this behavior is using a combination of the stop and shingle filters. Take a look at the first token in the response of your _analyze request:

  "token": "_ hilton",
  "start_offset": 8,
  "end_offset": 14,
  "type": "shingle",
  "position": 0,
  "positionLength": 2

The underscore in that token is the "filler token": the string that gets inserted wherever a stop word was removed. Because your documents do not have those stop words, there is no token _ hilton in the index for your documents, and because you are using an and operator, you're not matching any documents as a result.

One way to resolve this would be to define the filler_token to be "" in the shingle filter and move the strip filter to the end of the chain:

PUT test
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "analysis": {
        "analyzer": {
          "english": {
            "tokenizer": "standard",
            "filter": ["lowercase", "english_possessive_stemmer", "my_stop", "light_english_stemmer","my_shingle", "trim"]
          }
        },
        "filter": {
          "english_possessive_stemmer":{
            "type" : "stemmer",
            "language" : "possessive_english"
          },
          "light_english_stemmer":{
            "type" : "stemmer",
            "language" : "light_english"
          },
          "my_stop":{
            "type" : "stop",
            "stopwords" : ["what", "is"],
            "remove_trailing" : "false"
          },
          "my_shingle":{
            "type": "shingle",
            "filler_token": ""
          }
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

(Janaka Bandara) #3

Thank you very much @abdon. I didnt know about the filler_token. Cheers :slightly_smiling_face:


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.