Custom stopwords not working with custom tokenizer

alex.bour · April 4, 2017, 2:14pm

Hello ES users,

I have a problem to make my custom stopwords work with my custom Ngram tokenizer.

Here is my current mapping:

analysis: {
      analyzer: {
        custom_index_analyzer: {
          type: "custom",
          filter: ["lowercase", "asciifolding", "custom_word_delimiter", "custom_unique_token", "custom_en_stopwords", "custom_fr_stopwords", "custom_de_stopwords", "custom_es_stopwords", "custom_it_stopwords", "custom_pt_stopwords"],                           
          tokenizer: "ngram_tokenizer",  
        },
        custom_search_analyzer: {
          type: "custom",
          filter: ["lowercase", "asciifolding", "custom_word_delimiter", "custom_unique_token", "custom_en_stopwords", "custom_fr_stopwords", "custom_de_stopwords", "custom_es_stopwords", "custom_it_stopwords", "custom_pt_stopwords"],          
          tokenizer: "ngram_tokenizer",
        }
      },
      tokenizer: {
        ngram_tokenizer: {
          type: "nGram",
          min_gram: "3",
          max_gram: "3",
          token_chars: [ "letter", "digit" ]          
        }
      },      
      filter: {
        custom_word_delimiter: {
          type: "word_delimiter"
        },
        custom_unique_token: {
          type: "unique",
          only_on_same_position: "false"
        },
        custom_en_stopwords: {
          type: "stop",
          stopwords: ["winery", "wineries", "cellar", "cellars", "vineyard", "vineyards", "wine", "wines", "estate", "estates", "family", "families", "winegrower", "winegrowers", "company"],
          ignore_case: "true"
        },
        custom_fr_stopwords: {
          type: "stop",
          stopwords: ["chateau", "chateaux", "domaine", "domaines", "cave", "caves", "vignoble", "vignobles", "vin", "vins", "vigneron", "vignerons", "maison", "maisons", "ch", "ch."],
          ignore_case: "true"
        },        
      }
    }

And a simple test shows all tokens of the sentence, even the ones containing stopword "chateau":

curl -XPOST 'localhost:9200/wines/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
  "analyzer": "custom_search_analyzer",
  "text": "Château Cheval Blanc"
}
'
{
  "tokens" : [
    {
      "token" : "cha",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "hat",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ate",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "tea",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "eau",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "che",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "hev",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "eva",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "val",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "bla",
      "start_offset" : 15,
      "end_offset" : 18,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "lan",
      "start_offset" : 16,
      "end_offset" : 19,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "anc",
      "start_offset" : 17,
      "end_offset" : 20,
      "type" : "word",
      "position" : 11
    }
  ]
}

If I have a look in my app with real content, it appears that:
"Château Cheval Blanc" and "Cheval Blanc" don't have the same score, or they should, as Château is a stopword (lowercased and asciifolded).
Currently:

Cheval Blanc 35.72655
Château Cheval Blanc 32.094658

They both should have the same score.
What did I miss in my mapping ? Thanks.

system · May 2, 2017, 2:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stopwords are not working in custom tokenizer Elasticsearch	3	391	April 29, 2021
Stop-Words analyzers does not work as expected Elasticsearch	1	397	June 5, 2018
My stopwords filter is not working Elasticsearch	5	1937	July 6, 2017
Combining ngram tokenizer with stopwords Elasticsearch	1	105	April 12, 2024
Stop words are not working Elasticsearch	7	2576	July 5, 2017

Custom stopwords not working with custom tokenizer

Related topics