Mapping ignoring field analyzer setting

racedo · November 28, 2012, 3:17pm

Hi all,

I have the following mapping where text has the spanish analyzer:

$ curl -XGET 'localhost:9200/haystack/_mapping?pretty=true'
{
"haystack" : {
"modelresult" : {
"properties" : {
"content_auto" : {
"type" : "string",
"analyzer" : "edgengram_analyzer",
"store" : "yes",
"term_vector" : "with_positions_offsets"
},
"django_ct" : {
"type" : "string"
},
"django_id" : {
"type" : "string"
},
"id" : {
"type" : "string"
},
"pub_date" : {
"type" : "date",
"index" : "analyzed",
"store" : "yes",
"format" : "dateOptionalTime"
},

```
   "text" : {*
```
```
     "type" : "string",*
```
```
     "analyzer" : "spanish",*
```
```
     "store" : "yes",*
```

     "term_vector" : "with_positions_offsets"*

   },*
  "title" : {
    "type" : "string",
    "boost" : 1.5,
    "analyzer" : "spanish",
    "store" : "yes",
    "term_vector" : "with_positions_offsets"
  }
}

}
}
}

But it seems to be ignored ("esto" is a stopword in the spanish analyzer):

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&pretty=true&explain=true'
{
"tokens" : [ {
"token" : "esto",
"start_offset" : 0,
"end_offset" : 4,
"type" : "",
"position" : 1
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

But if I specify the analyzer directly then it works:

$ curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&pretty=true&explain=true&analyzer=spanish'
{
"tokens" : [ {
"token" : "is",
"start_offset" : 5,
"end_offset" : 7,
"type" : "",
"position" : 2
}, {
"token" : "test",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 4
} ]

Any idea what am I doing wrong? If I specify my own analyzers in the
settings it doesn't work either.

Many thanks.

--

Clinton_Gormley · November 28, 2012, 3:21pm

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a
+test&pretty=true&explain=true'

using 'text=' doesn't imply the 'text' field. it's just the name of the
parameter that you use to pass it text to be analyzed. you also have to
specify the analyzer (as you do in your second example)

clint

--

racedo · November 28, 2012, 3:33pm

$ curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a
+test&pretty=true&explain=true'

using 'text=' doesn't imply the 'text' field. it's just the name of the
parameter that you use to pass it text to be analyzed. you also have to
specify the analyzer (as you do in your second example)

Hi, I can see that it's using the default analyzer for any query, if I use the more_like_this query I see how it's not using the stop words for Spanish either. This indicates that it is using the default analyzer after my tests.

I thought that if I specify a mapping with an analyzer for a field it should use it for searching and indexing as per [1]:

"The analysis module allows one to register TokenFilters, Tokenizers andAnalyzers under logical names which can then be referenced either in mapping definitions or in certain APIs"

The only way I managed so far of using a particular analyzer is by changing the default analyzer but this won't allow me to have different analyzers for different fields.

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

--

Clinton_Gormley · November 28, 2012, 3:36pm

Hi Ramon

Hi, I can see that it's using the default analyzer for any query, if I
use the more_like_this query I see how it's not using the stop words
for Spanish either. This indicates that it is using the default
analyzer after my tests.

I thought that if I specify a mapping with an analyzer for a field it
should use it for searching and indexing as per [1]:

It depends what field you are searching on. If you search the _all field
then it uses its own analyzer. If you search the 'text' field, then it
will use the spanish analyzer

clint

--

racedo · November 28, 2012, 4:03pm

It depends what field you are searching on. If you search the _all field
then it uses its own analyzer. If you search the 'text' field, then it
will use the spanish analyzer

That's right, it works with queries as opposed to analyses (as you
suggested). If I search for "de" (stop word) I see if not indexed:

$ curl -XGET
'http://localhost:9200/haystack/modelresult/_search?q=text:de&pretty=true'

{

"took" : 2,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 0,

"max_score" : null,

"hits" : [ ]

}

And not this as from your previous reply text is not the field text in this
"analyze" query:

$ curl -XGET 'localhost:9200/haystack/_analyze?text=de&pretty=true'

{

"tokens" : [ {

"token" : "de",

"start_offset" : 0,

"end_offset" : 2,

"type" : "<ALPHANUM>",

"position" : 1

} ]

The big problem for me is the "more_like_this is not working with the
spanish analyzer, in this example I can see how it uses spanish stop words
and gets all the results wrong due to this. In the example "de" is used for
scoring and my goal is to get the more_like_this using an analyzer of my
choice:

curl -XGET
'http://localhost:9200/haystack/modelresult/_search?analyzer=spanish&pretty=true'
-d '

{ "explain": true,

"query" : {

 "more_like_this" : {

    "like_text" : "De Guindos anuncia que los bancos nacionalizados

recibirán 37.000 millones del préstamo europeo\nLa inyección de capital que
los bancos españoles nacionalizados (Bankia, Novacaixagalicia, Caixa de
Catalunya y Banco de Valencia)"

}

'

{

"took" : 5,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 335,

"max_score" : 0.45862257,

"hits" : [ {

  "_shard" : 2,

  "_node" : "m13ZuAepT2C94F95zw9kgQ",

  "_index" : "haystack",

  "_type" : "modelresult",

  "_id" : "frontpage.article.266",

  "_score" : 0.45862257, "_source" : {"django_id": "266", "title": "Los

sindicatos de Iberia estudian un calendario de huelga contra el ERE",
"text": "Los sindicatos de Iberia estudian un calendario de huelga contra
el ERE\nFuentes sindicales aseguran que los posibles paros en diciembre no
afectar\u00edan a los puentes ni a las vacaciones de Navidad. 
Leer .  Escuchar\n", "django_ct": "frontpage.article",
"content_auto": "Fuentes sindicales aseguran que los posibles paros en
diciembre no afectar\u00edan a los puentes ni a las vacaciones de
Navidad. Leer . Escuchar", "pub_date":
"2012-11-26T22:40:01+00:00", "id": "frontpage.article.266"},

  "_explanation" : {

    "value" : 0.45862257,

    "description" : "sum of:",

    "details" : [ {

      "value" : 0.13871443,

      "description" : "weight(_all:de in 52), product of:",

      "details" : [ {

        "value" : 0.5100887,

        "description" : "queryWeight(_all:de), product of:",

        "details" : [ {

          "value" : 1.0150379,

          "description" : "idf(docFreq=65, maxDocs=67)"

        }, {

          "value" : 0.50253165,

          "description" : "queryNorm"

        } ]

      },

How can I force the more_like_this results to search using an analyzer of
my choice? Or does more_like_this only use the default analyzer?

Thanks again.

--

racedo · November 29, 2012, 12:30am

How can I force the more_like_this results to search using an analyzer of
my choice? Or does more_like_this only use the default analyzer?

After some more trial and error I think the only way is to define the
default analyzer. It looks like the more_like_this queries ignore the per
field set analyzers and just use the default one. After setting the default
one it behaves as I want.

Thanks

--

Igor_Motov · November 29, 2012, 12:07pm

By default more_like_this is searching the special "_all" field, which is
indexed and searched by the default analyzer, so what you did
works. Alternatively, you could just specify "text" in the "fields"
parameter of the more_like_this request, and search the "text" field
instead.

Similarly, you can specify a field on the Analyze API query to see how it
will be analyzed:

curl -XGET
'localhost:9200/haystack/_analyze?text=esto+is+a+test&field=text&pretty=true'

On Wednesday, November 28, 2012 7:30:22 PM UTC-5, racedo wrote:

How can I force the more_like_this results to search using an analyzer of

my choice? Or does more_like_this only use the default analyzer?

After some more trial and error I think the only way is to define the
default analyzer. It looks like the more_like_this queries ignore the per
field set analyzers and just use the default one. After setting the default
one it behaves as I want.

Thanks

--

racedo · November 29, 2012, 12:17pm

By default more_like_this is searching the special "_all" field, which is indexed and searched by the default analyzer, so what you did works. Alternatively, you could just specify "text" in the "fields" parameter of the more_like_this request, and search the "text" field instead.

Similarly, you can specify a field on the Analyze API query to see how it will be analyzed:

curl -XGET 'localhost:9200/haystack/_analyze?text=esto+is+a+test&field=text&pretty=true'

Thanks for your insight Igor, this is exactly what I was looking for!

Regards.

Ramon

--

Topic		Replies	Views
What's the search analyzer of the field inside the document that applied _analyzer? Elasticsearch	2	457	April 9, 2014
Elasticsearch analyzer ignored from index settings and only working when specified directly in query Elasticsearch	0	321	November 26, 2012
Overriding built-in analyzer and set it as default Elasticsearch	9	1109	August 26, 2014
Query analzyer with respect to field/index analzyer Elasticsearch	4	398	September 19, 2013
Custom search analyzer problem Elasticsearch	8	1303	October 21, 2016

Mapping ignoring field analyzer setting

Related topics