Mapping ignoring field analyzer setting

Hi all,

I have the following mapping where text has the spanish analyzer:

$ curl -XGET 'localhost:9200/haystack/_mapping?pretty=true'
{
"haystack" : {
"modelresult" : {
"properties" : {
"content_auto" : {
"type" : "string",
"analyzer" : "edgengram_analyzer",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"django_ct" : {
"type" : "string"
},
"django_id" : {
"type" : "string"
},
"id" : {
"type" : "string"
},
"pub_date" : {
"type" : "date",
"index" : "analyzed",
"store" : "yes",
"format" : "dateOptionalTime"
},

  •    "text" : {*
    
  •      "type" : "string",*
    
  •      "analyzer" : "spanish",*
    
  •      "store" : "yes",*
    
  •      "term_vector" : "with_positions_offsets"*
    
  •    },*
      "title" : {
        "type" : "string",
        "boost" : 1.5,
        "analyzer" : "spanish",
        "store" : "yes",
        "term_vector" : "with_positions_offsets"
      }
    }
    
    }
    }
    }

But it seems to be ignored ("esto" is a stopword in the spanish analyzer):

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&pretty=true&explain=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

But if I specify the analyzer directly then it works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&pretty=true&explain=true&analyzer=spanish'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any idea what am I doing wrong? If I specify my own analyzers in the
settings it doesn't work either.

Many thanks.

--

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a
+test&pretty=true&explain=true'

using 'text=' doesn't imply the 'text' field. it's just the name of the
parameter that you use to pass it text to be analyzed. you also have to
specify the analyzer (as you do in your second example)

clint

--

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a
+test&pretty=true&explain=true'

using 'text=' doesn't imply the 'text' field. it's just the name of the
parameter that you use to pass it text to be analyzed. you also have to
specify the analyzer (as you do in your second example)

Hi, I can see that it's using the default analyzer for any query, if I use the more_like_this query I see how it's not using the stop words for Spanish either. This indicates that it is using the default analyzer after my tests.

I thought that if I specify a mapping with an analyzer for a field it should use it for searching and indexing as per [1]:

"The analysis module allows one to register TokenFilters, Tokenizers andAnalyzers under logical names which can then be referenced either in mapping definitions or in certain APIs"

The only way I managed so far of using a particular analyzer is by changing the default analyzer but this won't allow me to have different analyzers for different fields.

[1] http://www.elasticsearch.org/guide/reference/index-modules/analysis/

--

Hi Ramon

Hi, I can see that it's using the default analyzer for any query, if I
use the more_like_this query I see how it's not using the stop words
for Spanish either. This indicates that it is using the default
analyzer after my tests.

I thought that if I specify a mapping with an analyzer for a field it
should use it for searching and indexing as per [1]:

It depends what field you are searching on. If you search the _all field
then it uses its own analyzer. If you search the 'text' field, then it
will use the spanish analyzer

clint

--

It depends what field you are searching on. If you search the _all field
then it uses its own analyzer. If you search the 'text' field, then it
will use the spanish analyzer

That's right, it works with queries as opposed to analyses (as you
suggested). If I search for "de" (stop word) I see if not indexed:

$ curl -XGET
'http://localhost:9200/haystack/modelresult/_search?q=text:de&pretty=true'

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 0,

"max_score" : null,

"hits" : [ ]

}

And not this as from your previous reply text is not the field text in this
"analyze" query:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=de&pretty=true'

{

"tokens" : [ {

"token" : "de",

"start_offset" : 0,

"end_offset" : 2,

"type" : "<ALPHANUM>",

"position" : 1

} ]

The big problem for me is the "more_like_this is not working with the
spanish analyzer, in this example I can see how it uses spanish stop words
and gets all the results wrong due to this. In the example "de" is used for
scoring and my goal is to get the more_like_this using an analyzer of my
choice:

curl -XGET
'http://localhost:9200/haystack/modelresult/_search?analyzer=spanish&pretty=true'
-d '

{ "explain": true,

"query" : {

 "more_like_this" : {

    "like_text" : "De Guindos anuncia que los bancos nacionalizados 

recibirán 37.000 millones del préstamo europeo\nLa inyección de capital que
los bancos españoles nacionalizados (Bankia, Novacaixagalicia, Caixa de
Catalunya y Banco de Valencia)"

  }

}

}

'

{

"took" : 5,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 335,

"max_score" : 0.45862257,

"hits" : [ {

  "_shard" : 2,

  "_node" : "m13ZuAepT2C94F95zw9kgQ",

  "_index" : "haystack",

  "_type" : "modelresult",

  "_id" : "frontpage.article.266",

  "_score" : 0.45862257, "_source" : {"django_id": "266", "title": "Los 

sindicatos de Iberia estudian un calendario de huelga contra el ERE",
"text": "Los sindicatos de Iberia estudian un calendario de huelga contra
el ERE\nFuentes sindicales aseguran que los posibles paros en diciembre no
afectar\u00edan a los puentes ni a las vacaciones de Navidad.&#160;
Leer .&#160; Escuchar\n", "django_ct": "frontpage.article",
"content_auto": "Fuentes sindicales aseguran que los posibles paros en
diciembre no afectar\u00edan a los puentes ni a las vacaciones de
Navidad.  Leer .  Escuchar", "pub_date":
"2012-11-26T22:40:01+00:00", "id": "frontpage.article.266"},

  "_explanation" : {

    "value" : 0.45862257,

    "description" : "sum of:",

    "details" : [ {

      "value" : 0.13871443,

      "description" : "weight(_all:de in 52), product of:",

      "details" : [ {

        "value" : 0.5100887,

        "description" : "queryWeight(_all:de), product of:",

        "details" : [ {

          "value" : 1.0150379,

          "description" : "idf(docFreq=65, maxDocs=67)"

        }, {

          "value" : 0.50253165,

          "description" : "queryNorm"

        } ]

      },

How can I force the more_like_this results to search using an analyzer of
my choice? Or does more_like_this only use the default analyzer?

Thanks again.

--

How can I force the more_like_this results to search using an analyzer of
my choice? Or does more_like_this only use the default analyzer?

After some more trial and error I think the only way is to define the
default analyzer. It looks like the more_like_this queries ignore the per
field set analyzers and just use the default one. After setting the default
one it behaves as I want.

Thanks

--

By default more_like_this is searching the special "_all" field, which is
indexed and searched by the default analyzer, so what you did
works. Alternatively, you could just specify "text" in the "fields"
parameter of the more_like_this request, and search the "text" field
instead.

Similarly, you can specify a field on the Analyze API query to see how it
will be analyzed:

curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&field=text&pretty=true'

On Wednesday, November 28, 2012 7:30:22 PM UTC-5, racedo wrote:

How can I force the more_like_this results to search using an analyzer of

my choice? Or does more_like_this only use the default analyzer?

After some more trial and error I think the only way is to define the
default analyzer. It looks like the more_like_this queries ignore the per
field set analyzers and just use the default one. After setting the default
one it behaves as I want.

Thanks

--

By default more_like_this is searching the special "_all" field, which is indexed and searched by the default analyzer, so what you did works. Alternatively, you could just specify "text" in the "fields" parameter of the more_like_this request, and search the "text" field instead.

Similarly, you can specify a field on the Analyze API query to see how it will be analyzed:

curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test&field=text&pretty=true'

Thanks for your insight Igor, this is exactly what I was looking for!

Regards.

Ramon

--