Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4

I am trying to get keywords from a bunch of tweets in Spanish language. The thing is that when I get the results the last vowel in most words in the response is removed. Any idea of why is this happening?

Here is the query:

{
                "query": { 
                    "bool": {
                        "must": {
                            "terms": {
                                "full_text_sentiment": "positive"
                            }
                        },
                        "filter": {
                            "range": {
                                "created_at": {
                                    "gte": greaterThanTime,
                                    "lte": lessThanTime
                                }
                            }
                        }   
                    }
                },
                "aggs": {
                    "keywords": {
                        "terms": { "field": "full_text_clean", "size": 10}
                    }
                }
            }

The mapping is the following for the field:

"full_text_clean": {
                    "type": "text",
                    "analyzer": "spanish",
                    "fielddata": true,
                    "fielddata_frequency_filter": {
                        "min": 0.1,
                        "max": 1.0,
                        "min_segment_size": 10
                    },
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 512
                        }
                    }
                }

And this is the buckets in the response:

[ { key: 'aquí', doc_count: 3 },
  { key: 'deport', doc_count: 3 },
  { key: 'informacion', doc_count: 3 },
  { key: '23', doc_count: 2 },
  { key: 'corazon', doc_count: 2 },
  { key: 'dios', doc_count: 2 },
  { key: 'mexic', doc_count: 2 },
  { key: 'mujer', doc_count: 2 },
  { key: 'quier', doc_count: 2 },
  { key: 'siempr', doc_count: 2 }]

where "deport", should be "deporte", "mexic" should be "mexico", "quier" should be "quiero" etc.

Any idea of what is happening

Thank you!

Hey,

try using the analyze API with the spanish analyzer on some terms. The analyze API is doing the same steps that are done when a field of a document is split into terms and shows which terms are stored in the inverted index - as those are the ones that are also used in the aggregation response.

I suppose the spanish analyzer is playing a role here. You might want to test with the standard analyzer for comparison.

--Alex

Hello Alexander,

Thank you very much for your response.
I already tried that and yes, it is because of the analyzer as I show below:

This is the test request:

GET _analyze
{
  "analyzer" : "spanish",
  "text" : "pero no puedo decir nada de deportes. Quiero que esto suceda en México"
}

And this is the response:

"tokens": [
    {
      "token": "pued",
      "start_offset": 8,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "decir",
      "start_offset": 14,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "deport",
      "start_offset": 28,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "quier",
      "start_offset": 38,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "suced",
      "start_offset": 54,
      "end_offset": 60,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "mexic",
      "start_offset": 64,
      "end_offset": 70,
      "type": "<ALPHANUM>",
      "position": 12
    }
  ]

It comes with the same issue. Any idea of how to solve it? I would like to keep using the analyzer in order to remove stop words, but with this issue, I cannot disseminate any functionality.

-- Horacio

you could use a stop analyzer with your own custom set of stopwords or take a look at the stop tokenfilter, which can be configured to use spanish stopwords.

hope that helps.

--Alex

Turns out, that it is the stemmer in the Spanish analyzer, I just redefined and skipped the stemmer section

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.