Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4

hcanales · August 7, 2019, 4:04am

I am trying to get keywords from a bunch of tweets in Spanish language. The thing is that when I get the results the last vowel in most words in the response is removed. Any idea of why is this happening?

Here is the query:

{
                "query": { 
                    "bool": {
                        "must": {
                            "terms": {
                                "full_text_sentiment": "positive"
                            }
                        },
                        "filter": {
                            "range": {
                                "created_at": {
                                    "gte": greaterThanTime,
                                    "lte": lessThanTime
                                }
                            }
                        }   
                    }
                },
                "aggs": {
                    "keywords": {
                        "terms": { "field": "full_text_clean", "size": 10}
                    }
                }
            }

The mapping is the following for the field:

"full_text_clean": {
                    "type": "text",
                    "analyzer": "spanish",
                    "fielddata": true,
                    "fielddata_frequency_filter": {
                        "min": 0.1,
                        "max": 1.0,
                        "min_segment_size": 10
                    },
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 512
                        }
                    }
                }

And this is the buckets in the response:

[ { key: 'aquí', doc_count: 3 },
  { key: 'deport', doc_count: 3 },
  { key: 'informacion', doc_count: 3 },
  { key: '23', doc_count: 2 },
  { key: 'corazon', doc_count: 2 },
  { key: 'dios', doc_count: 2 },
  { key: 'mexic', doc_count: 2 },
  { key: 'mujer', doc_count: 2 },
  { key: 'quier', doc_count: 2 },
  { key: 'siempr', doc_count: 2 }]

where "deport", should be "deporte", "mexic" should be "mexico", "quier" should be "quiero" etc.

Any idea of what is happening

Thank you!

spinscale · August 7, 2019, 7:13am

Hey,

try using the analyze API with the spanish analyzer on some terms. The analyze API is doing the same steps that are done when a field of a document is split into terms and shows which terms are stored in the inverted index - as those are the ones that are also used in the aggregation response.

I suppose the spanish analyzer is playing a role here. You might want to test with the standard analyzer for comparison.

--Alex

hcanales · August 7, 2019, 8:24pm

Hello Alexander,

Thank you very much for your response.
I already tried that and yes, it is because of the analyzer as I show below:

This is the test request:

GET _analyze
{
  "analyzer" : "spanish",
  "text" : "pero no puedo decir nada de deportes. Quiero que esto suceda en México"
}

And this is the response:

"tokens": [
    {
      "token": "pued",
      "start_offset": 8,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "decir",
      "start_offset": 14,
      "end_offset": 19,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "deport",
      "start_offset": 28,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 6
    },
    {
      "token": "quier",
      "start_offset": 38,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 7
    },
    {
      "token": "suced",
      "start_offset": 54,
      "end_offset": 60,
      "type": "<ALPHANUM>",
      "position": 10
    },
    {
      "token": "mexic",
      "start_offset": 64,
      "end_offset": 70,
      "type": "<ALPHANUM>",
      "position": 12
    }
  ]

It comes with the same issue. Any idea of how to solve it? I would like to keep using the analyzer in order to remove stop words, but with this issue, I cannot disseminate any functionality.

-- Horacio

spinscale · August 8, 2019, 9:39am

you could use a stop analyzer with your own custom set of stopwords or take a look at the stop tokenfilter, which can be configured to use spanish stopwords.

hope that helps.

--Alex

hcanales · August 12, 2019, 3:31pm

Turns out, that it is the stemmer in the Spanish analyzer, I just redefined and skipped the stemmer section

system · September 9, 2019, 3:31pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Stop words not used by the analyzer Elasticsearch	5	659	July 6, 2017
Analyser doesn't remove English stopwords Elasticsearch	3	467	June 4, 2018
Stopwords in term aggregation Elasticsearch	7	1176	July 5, 2017
Can _termvector return only real english words and ignore everything else? Elasticsearch	1	603	July 5, 2017
Stop words and Keyword tokenizer Elasticsearch	12	1985	July 6, 2017

Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4

Related topics