Hello ES users,
I have a problem to make my custom stopwords work with my custom Ngram tokenizer.
Here is my current mapping:
analysis: {
analyzer: {
custom_index_analyzer: {
type: "custom",
filter: ["lowercase", "asciifolding", "custom_word_delimiter", "custom_unique_token", "custom_en_stopwords", "custom_fr_stopwords", "custom_de_stopwords", "custom_es_stopwords", "custom_it_stopwords", "custom_pt_stopwords"],
tokenizer: "ngram_tokenizer",
},
custom_search_analyzer: {
type: "custom",
filter: ["lowercase", "asciifolding", "custom_word_delimiter", "custom_unique_token", "custom_en_stopwords", "custom_fr_stopwords", "custom_de_stopwords", "custom_es_stopwords", "custom_it_stopwords", "custom_pt_stopwords"],
tokenizer: "ngram_tokenizer",
}
},
tokenizer: {
ngram_tokenizer: {
type: "nGram",
min_gram: "3",
max_gram: "3",
token_chars: [ "letter", "digit" ]
}
},
filter: {
custom_word_delimiter: {
type: "word_delimiter"
},
custom_unique_token: {
type: "unique",
only_on_same_position: "false"
},
custom_en_stopwords: {
type: "stop",
stopwords: ["winery", "wineries", "cellar", "cellars", "vineyard", "vineyards", "wine", "wines", "estate", "estates", "family", "families", "winegrower", "winegrowers", "company"],
ignore_case: "true"
},
custom_fr_stopwords: {
type: "stop",
stopwords: ["chateau", "chateaux", "domaine", "domaines", "cave", "caves", "vignoble", "vignobles", "vin", "vins", "vigneron", "vignerons", "maison", "maisons", "ch", "ch."],
ignore_case: "true"
},
}
}
And a simple test shows all tokens of the sentence, even the ones containing stopword "chateau":
curl -XPOST 'localhost:9200/wines/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
"analyzer": "custom_search_analyzer",
"text": "Château Cheval Blanc"
}
'
{
"tokens" : [
{
"token" : "cha",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "hat",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "ate",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "tea",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "eau",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 4
},
{
"token" : "che",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 5
},
{
"token" : "hev",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 6
},
{
"token" : "eva",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 7
},
{
"token" : "val",
"start_offset" : 11,
"end_offset" : 14,
"type" : "word",
"position" : 8
},
{
"token" : "bla",
"start_offset" : 15,
"end_offset" : 18,
"type" : "word",
"position" : 9
},
{
"token" : "lan",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 10
},
{
"token" : "anc",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 11
}
]
}
If I have a look in my app with real content, it appears that:
"Château Cheval Blanc" and "Cheval Blanc" don't have the same score, or they should, as Château is a stopword (lowercased and asciifolded).
Currently:
Cheval Blanc 35.72655
Château Cheval Blanc 32.094658
They both should have the same score.
What did I miss in my mapping ? Thanks.