I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:
'settings': {
"analysis": {
"analyzer": {
"ngram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["ramon_stopwords", "haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["ramon_stopwords", "haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}
The settings look like this for the haystack index:
$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" : "15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" : "haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" : "ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}
Which looks right to me. But when testing it the stopwords that are applied
are only the ones for English and the ones I add remain ignored. See how
"is" is filtered here:
$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que
&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]
The only way I manage to change the stopwords is changing the analyzer in
the query, but I have tried in the settings too and it doesn't work either.
This example with the Spanish analyzer works:
$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]
Any hint to where this might be failing?
Many thanks.
--