Can't get stop words working

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:

    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["ramon_stopwords", "haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["ramon_stopwords", "haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                },
                "ramon_stopwords": {
                    "type": "stop",
                    "stopwords": ["esto","de","que"]
                }
            }
        }
    }
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" : "15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" : "haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" : "ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are applied
are only the ones for English and the ones I add remain ignored. See how
"is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que
&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer in
the query, but I have tried in the settings too and it doesn't work either.
This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default analyzer :
[2]

[1]

[2]

HTH
David.

Le 26 novembre 2012 à 20:50, racedo ramon@linux-labs.net a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just for
testing purposes:

     'settings': {
         "analysis": {
             "analyzer": {
                 "ngram_analyzer": {
                     "type": "custom",
                     "tokenizer": "lowercase",
                     "filter": ["ramon_stopwords", "haystack_ngram"]
                 },
                 "edgengram_analyzer": {
                     "type": "custom",
                     "tokenizer": "lowercase",
                     "filter": ["ramon_stopwords", "haystack_edgengram"]
                 }
             },
             "tokenizer": {
                 "haystack_ngram_tokenizer": {
                     "type": "nGram",
                     "min_gram": 3,
                     "max_gram": 15,
                 },
                 "haystack_edgengram_tokenizer": {
                     "type": "edgeNGram",
                     "min_gram": 2,
                     "max_gram": 15,
                     "side": "front"
                 }
             },
             "filter": {
                 "haystack_ngram": {
                     "type": "nGram",
                     "min_gram": 3,
                     "max_gram": 15
                 },
                 "haystack_edgengram": {
                     "type": "edgeNGram",
                     "min_gram": 2,
                     "max_gram": 15
                 },
                 "ramon_stopwords": {
                     "type": "stop",
                     "stopwords": ["esto","de","que"]
                 }
             }
         }
     }
 }

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" : "15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" : "2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" : "front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" : "haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" : "ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are applied
are only the ones for English and the ones I add remain ignored. See how "is"
is filtered here:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer in the
query, but I have tried in the settings too and it doesn't work either. This
example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Hi David,

Your feedback really helps me to understand it better (I only started with
ES last weekend!). So, after defining an analyzer, is a mapping mandatory?
As per the ES help page [1] I understood that it was needed when a
different analyzer would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:

    'settings': {
       "analysis": {
          "analyzer": {
             "default": {
                 "type": "spanish",
                 "stopwords": ["_spanish_","quot"]
             }
          },
       }
    }

Should I rather use a mapping like the one in [1] then? Apologies if this
is a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <ra...@linux-labs.net <javascript:>>
a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:

     'settings': { 
         "analysis": { 
             "analyzer": { 
                 "ngram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", "haystack_ngram"] 
                 }, 
                 "edgengram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", 

"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are
applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que
&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer
in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

When you have defined a custom analyzer, you have to set where you want to apply
it.
When you send your first document, ES compute a mapping automagicaly.

You can get back this mapping with a curl localhost:9200/index/type/_mapping

Then you can adapt it (add your analyzer to one or many field, as you need) and
send it back to ES:

First delete all documents:
curl -XDELETE localhost:9200/index/type

Then, send it again:

curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "analyzer" : "youranalyzername"}
}
}
}
'

Then send your first document.

Does it help?

David

Le 27 novembre 2012 à 19:13, racedo ramon@linux-labs.net a écrit :

Hi David,

Your feedback really helps me to understand it better (I only started with ES
last weekend!). So, after defining an analyzer, is a mapping mandatory? As per
the ES help page [1] I understood that it was needed when a different analyzer
would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:

     'settings': {
        "analysis": {
           "analyzer": {
              "default": {
                  "type": "spanish",
                  "stopwords": ["_spanish_","quot"]
              }
           },
        }
     }

Should I rather use a mapping like the one in [1] then? Apologies if this is
a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
[2]
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
Elasticsearch Platform — Find real-time answers at scale | Elastic
http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <
http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html
ra...@linux-labs.net> a écrit :

> > > I'm trying to add some stopwords to the default settings that
> > > haystack is using and the settings look like this (added "esto",
> > > "que" and "de" just for testing purposes:
        'settings': {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "tokenizer": "lowercase",
                        "filter": ["ramon_stopwords",

"haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["ramon_stopwords",
"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'

http://localhost:9200/haystack/_settings?pretty=true
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram"
: "2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" :
"3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" :
"nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram"
: "15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are

applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:

$ curl -XGET

'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer

in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:

$ curl -XGET

'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.




--
 <http://localhost:9200/haystack/_settings?pretty=true>

http://localhost:9200/haystack/_settings?pretty=true

--
David Pilato
http://localhost:9200/haystack/_settings?pretty=true
http://www.scrutmydocs.org/ http://www.scrutmydocs.org/
http://dev.david.pilato.fr/ http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

It does indeed help. Actually my problem was a bug in django haystack
(https://github.com/toastdriven/django-haystack/issues/686) which manages
the mappings through pyelasticsearch.

Your suggestions are great, thanks to the above bug (which had me
researching on mappings non-stop) and your feedback I got up to speed with
elasticsearch in three intensive days. Really appreciated.

Ramon

On Tuesday, 27 November 2012 20:44:33 UTC, David Pilato wrote:

When you have defined a custom analyzer, you have to set where you want
to apply it.
When you send your first document, ES compute a mapping automagicaly.

You can get back this mapping with a curl
localhost:9200/index/type/_mapping

Then you can adapt it (add your analyzer to one or many field, as you
need) and send it back to ES:

First delete all documents:
curl -XDELETE localhost:9200/index/type

Then, send it again:

curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "analyzer" : "youranalyzername"}
}
}
}
'

Then send your first document.

Does it help?

David

Le 27 novembre 2012 à 19:13, racedo <ra...@linux-labs.net <javascript:>>
a écrit :

Hi David,

Your feedback really helps me to understand it better (I only started
with ES last weekend!). So, after defining an analyzer, is a mapping
mandatory? As per the ES help page [1] I understood that it was needed when
a different analyzer would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:

     'settings': { 
        "analysis": { 
           "analyzer": { 
              "default": { 
                  "type": "spanish", 
                  "stopwords": ["_spanish_","quot"] 
              } 
           }, 
        } 
     } 

Should I rather use a mapping like the one in [1] then? Apologies if
this is a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

[2]
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html
ra...@linux-labs.net> a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:

     'settings': { 
         "analysis": { 
             "analyzer": { 
                 "ngram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", "haystack_ngram"] 
                 }, 
                 "edgengram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", 

"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
http://localhost:9200/haystack/_settings?pretty=true
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" :
"nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are
applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+
que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer
in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

http://localhost:9200/haystack/_settings?pretty=true
http://localhost:9200/haystack/_settings?pretty=true

--
David Pilato
http://localhost:9200/haystack/_settings?pretty=true
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--