Can't get stop words working

racedo · November 26, 2012, 7:50pm

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:

    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["ramon_stopwords", "haystack_ngram"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["ramon_stopwords", "haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 2,
                    "max_gram": 15
                },
                "ramon_stopwords": {
                    "type": "stop",
                    "stopwords": ["esto","de","que"]
                }
            }
        }
    }
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" : "15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" : "haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" : "ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are applied
are only the ones for English and the ones I add remain ignored. See how
"is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que
&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer in
the query, but I have tried in the settings too and it doesn't work either.
This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

dadoonet · November 27, 2012, 10:06am

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default analyzer :
[2]

[1]

[2]

HTH
David.

Le 26 novembre 2012 à 20:50, racedo ramon@linux-labs.net a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just for
testing purposes:
     'settings': {
         "analysis": {
             "analyzer": {
                 "ngram_analyzer": {
                     "type": "custom",
                     "tokenizer": "lowercase",
                     "filter": ["ramon_stopwords", "haystack_ngram"]
                 },
                 "edgengram_analyzer": {
                     "type": "custom",
                     "tokenizer": "lowercase",
                     "filter": ["ramon_stopwords", "haystack_edgengram"]
                 }
             },
             "tokenizer": {
                 "haystack_ngram_tokenizer": {
                     "type": "nGram",
                     "min_gram": 3,
                     "max_gram": 15,
                 },
                 "haystack_edgengram_tokenizer": {
                     "type": "edgeNGram",
                     "min_gram": 2,
                     "max_gram": 15,
                     "side": "front"
                 }
             },
             "filter": {
                 "haystack_ngram": {
                     "type": "nGram",
                     "min_gram": 3,
                     "max_gram": 15
                 },
                 "haystack_edgengram": {
                     "type": "edgeNGram",
                     "min_gram": 2,
                     "max_gram": 15
                 },
                 "ramon_stopwords": {
                     "type": "stop",
                     "stopwords": ["esto","de","que"]
                 }
             }
         }
     }
 }
The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" : "15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" : "2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" : "front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" : "haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" : "ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are applied
are only the ones for English and the ones I add remain ignored. See how "is"
is filtered here:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer in the
query, but I have tried in the settings too and it doesn't work either. This
example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

racedo · November 27, 2012, 6:13pm

Hi David,

Your feedback really helps me to understand it better (I only started with
ES last weekend!). So, after defining an analyzer, is a mapping mandatory?
As per the ES help page [1] I understood that it was needed when a
different analyzer would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:

    'settings': {
       "analysis": {
          "analyzer": {
             "default": {
                 "type": "spanish",
                 "stopwords": ["_spanish_","quot"]
             }
          },
       }
    }

Should I rather use a mapping like the one in [1] then? Apologies if this
is a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <ra...@linux-labs.net <javascript:>>
a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:
     'settings': { 
         "analysis": { 
             "analyzer": { 
                 "ngram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", "haystack_ngram"] 
                 }, 
                 "edgengram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", 
"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" : "nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are
applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+que
&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer
in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

dadoonet · November 27, 2012, 8:44pm

When you have defined a custom analyzer, you have to set where you want to apply
it.
When you send your first document, ES compute a mapping automagicaly.

You can get back this mapping with a curl localhost:9200/index/type/_mapping

Then you can adapt it (add your analyzer to one or many field, as you need) and
send it back to ES:

First delete all documents:
curl -XDELETE localhost:9200/index/type

Then, send it again:

curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "analyzer" : "youranalyzername"}
}
}
}
'

Then send your first document.

Does it help?

David

Le 27 novembre 2012 à 19:13, racedo ramon@linux-labs.net a écrit :

Hi David,

Your feedback really helps me to understand it better (I only started with ES
last weekend!). So, after defining an analyzer, is a mapping mandatory? As per
the ES help page [1] I understood that it was needed when a different analyzer
would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:
     'settings': {
        "analysis": {
           "analyzer": {
              "default": {
                  "type": "spanish",
                  "stopwords": ["_spanish_","quot"]
              }
           },
        }
     }
Should I rather use a mapping like the one in [1] then? Apologies if this is
a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:
Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
[2]
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
Elasticsearch Platform — Find real-time answers at scale | Elastic
http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <
http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html
ra...@linux-labs.net> a écrit :
> > > I'm trying to add some stopwords to the default settings that
> > > haystack is using and the settings look like this (added "esto",
> > > "que" and "de" just for testing purposes:
        'settings': {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "tokenizer": "lowercase",
                        "filter": ["ramon_stopwords",
"haystack_ngram"]
},
"edgengram_analyzer": {
"type": "custom",
"tokenizer": "lowercase",
"filter": ["ramon_stopwords",
"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}
The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
http://localhost:9200/haystack/_settings?pretty=true
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram"
: "2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" :
"3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" :
"nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram"
: "15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}
Which looks right to me. But when testing it the stopwords that are
applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:
$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]
The only way I manage to change the stopwords is changing the analyzer
in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:
$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]
Any hint to where this might be failing?

Many thanks.




--
 <http://localhost:9200/haystack/_settings?pretty=true>
http://localhost:9200/haystack/_settings?pretty=true
--
David Pilato
http://localhost:9200/haystack/_settings?pretty=true
http://www.scrutmydocs.org/ http://www.scrutmydocs.org/
http://dev.david.pilato.fr/ http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

racedo · November 28, 2012, 2:08am

It does indeed help. Actually my problem was a bug in django haystack
(https://github.com/toastdriven/django-haystack/issues/686) which manages
the mappings through pyelasticsearch.

Your suggestions are great, thanks to the above bug (which had me
researching on mappings non-stop) and your feedback I got up to speed with
elasticsearch in three intensive days. Really appreciated.

Ramon

On Tuesday, 27 November 2012 20:44:33 UTC, David Pilato wrote:

When you have defined a custom analyzer, you have to set where you want
to apply it.
When you send your first document, ES compute a mapping automagicaly.

You can get back this mapping with a curl
localhost:9200/index/type/_mapping

Then you can adapt it (add your analyzer to one or many field, as you
need) and send it back to ES:

First delete all documents:
curl -XDELETE localhost:9200/index/type

Then, send it again:

curl -XPUT 'http://localhost:9200/twitter/tweet/_mapping' -d '
{
"tweet" : {
"properties" : {
"message" : {"type" : "string", "analyzer" : "youranalyzername"}
}
}
}
'

Then send your first document.

Does it help?

David

Le 27 novembre 2012 à 19:13, racedo <ra...@linux-labs.net <javascript:>>
a écrit :

Hi David,

Your feedback really helps me to understand it better (I only started
with ES last weekend!). So, after defining an analyzer, is a mapping
mandatory? As per the ES help page [1] I understood that it was needed when
a different analyzer would be applied to different document fields.

So far I've managed to get it working by using "default" on my index and
without a mapping:
     'settings': { 
        "analysis": { 
           "analyzer": { 
              "default": { 
                  "type": "spanish", 
                  "stopwords": ["_spanish_","quot"] 
              } 
           }, 
        } 
     } 
Should I rather use a mapping like the one in [1] then? Apologies if
this is a basic question, I must be misinterpreting the documentation.

Thanks again.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

On Tuesday, November 27, 2012 10:06:18 AM UTC, David Pilato wrote:

Hi,

You have just defined an analyzer. Fine.
Now you have to apply it on your mapping [1].
By default, ES use the standard analyzer. You can change the default
analyzer : [2]

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

[2]
http://www.elasticsearch.org/guide/reference/api/admin-indices-put-mapping.html
Elasticsearch Platform — Find real-time answers at scale | Elastic

HTH
David.

Le 26 novembre 2012 à 20:50, racedo <http://www.elasticsearch.org/guide/reference/index-modules/analysis/index.html
ra...@linux-labs.net> a écrit :

I'm trying to add some stopwords to the default settings that haystack is
using and the settings look like this (added "esto", "que" and "de" just
for testing purposes:
     'settings': { 
         "analysis": { 
             "analyzer": { 
                 "ngram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", "haystack_ngram"] 
                 }, 
                 "edgengram_analyzer": { 
                     "type": "custom", 
                     "tokenizer": "lowercase", 
                     "filter": ["ramon_stopwords", 
"haystack_edgengram"]
}
},
"tokenizer": {
"haystack_ngram_tokenizer": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15,
},
"haystack_edgengram_tokenizer": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15,
"side": "front"
}
},
"filter": {
"haystack_ngram": {
"type": "nGram",
"min_gram": 3,
"max_gram": 15
},
"haystack_edgengram": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 15
},
"ramon_stopwords": {
"type": "stop",
"stopwords": ["esto","de","que"]
}
}
}
}
}

The settings look like this for the haystack index:

$ curl -XGET 'http://localhost:9200/haystack/_settings?pretty=true'
http://localhost:9200/haystack/_settings?pretty=true
{
"haystack" : {
"settings" : {
"index.analysis.filter.haystack_edgengram.min_gram" : "2",
"index.analysis.filter.haystack_ngram.max_gram" : "15",
"index.analysis.tokenizer.haystack_ngram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.edgengram_analyzer.type" : "custom",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.min_gram" :
"2",
"index.analysis.filter.ramon_stopwords.stopwords.2" : "que",
"index.analysis.filter.ramon_stopwords.stopwords.1" : "de",
"index.analysis.filter.ramon_stopwords.stopwords.0" : "esto",
"index.analysis.tokenizer.haystack_ngram_tokenizer.min_gram" : "3",
"index.analysis.analyzer.ngram_analyzer.tokenizer" : "lowercase",
"index.analysis.filter.haystack_ngram.min_gram" : "3",
"index.analysis.analyzer.edgengram_analyzer.tokenizer" :
"lowercase",
"index.analysis.filter.haystack_edgengram.max_gram" : "15",
"index.analysis.filter.haystack_ngram.type" : "nGram",
"index.analysis.analyzer.edgengram_analyzer.filter.1" :
"haystack_edgengram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.type" :
"edgeNGram",
"index.analysis.analyzer.edgengram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.side" :
"front",
"index.analysis.filter.ramon_stopwords.type" : "stop",
"index.analysis.filter.haystack_edgengram.type" : "edgeNGram",
"index.analysis.tokenizer.haystack_ngram_tokenizer.type" :
"nGram",
"index.analysis.tokenizer.haystack_edgengram_tokenizer.max_gram" :
"15",
"index.analysis.analyzer.ngram_analyzer.filter.1" :
"haystack_ngram",
"index.analysis.analyzer.ngram_analyzer.filter.0" :
"ramon_stopwords",
"index.analysis.analyzer.ngram_analyzer.type" : "custom",
"index.number_of_shards" : "5",
"index.number_of_replicas" : "1",
"index.version.created" : "191199"
}
}

Which looks right to me. But when testing it the stopwords that are
applied are only the ones for English and the ones I add remain ignored.
See how "is" is filtered here:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test+
que&pretty=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
}, {
"token" : "que",
"start_offset" : 15,
"end_offset" : 18,
"type" : "",
"position" : 5
} ]

The only way I manage to change the stopwords is changing the analyzer
in the query, but I have tried in the settings too and it doesn't work
either. This example with the Spanish analyzer works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test+que&analyzer=spanish&pr
etty=true'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any hint to where this might be failing?

Many thanks.

--

http://localhost:9200/haystack/_settings?pretty=true
http://localhost:9200/haystack/_settings?pretty=true

--
David Pilato
http://localhost:9200/haystack/_settings?pretty=true
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--