How to search with correct stemming?

Marco_Solari · February 15, 2024, 3:04pm

Sorry, I'm really new to elasticsearch...
I'm trying basic functionalities...

I created my first index (to be used for italian language):

PUT index_it
{
  "settings": {
    "analysis": {
      "filter": {
        "italian_elision": {
          "type": "elision",
          "articles": [
            "c",
            "l",
            "all",
            "dall",
            "dell",
            "nell",
            "sull",
            "coll",
            "pell",
            "gl",
            "agl",
            "dagl",
            "degl",
            "negl",
            "sugl",
            "un",
            "m",
            "t",
            "s",
            "v",
            "d"
          ],
          "articles_case": true
        },
        "italian_stop": {
          "type": "stop",
          "stopwords": "_italian_"
        },
        "italian_keywords": {
          "type": "keyword_marker",
          "keywords": [
            "esempio"
          ]
        },
        "italian_stemmer": {
          "type": "stemmer",
          "language": "italian"
        }
      },
      "analyzer": {
        "italian_full": {
          "tokenizer": "standard",
          "filter": [
            "italian_elision",
            "lowercase",
            "italian_stop",
            "italian_keywords",
            "italian_stemmer"
          ]
        }
      }
    }
  }
}

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "index_it"
}

and put one word in index ("torta", Italian for "cake"):

PUT index_it/_doc/1
{
  "title": "torta"
}

{
  "_index": "index_it",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

If I analyze that word, I see it's correclty "stemmed" as "tort":

POST index_it/_analyze
{
  "analyzer": "italian_full",
  "text": "torta"
}

{
  "tokens": [
    {
      "token": "tort",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

I'd expect to be able to search for both "torta" and "torte" (same word, plural).
I can find the singular form, but not the plural... :-/

GET index_it/_search
{
  "query": {
    "match": {
      "title": "torta"
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "index_it",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "title": "torta"
        }
      }
    ]
  }
}

But not "torte":

GET index_it/_search
{
  "query": {
    "simple_query_string": {
      "fields": [ "title" ],
      "query": "torte"
    }
  }
}

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

What do I miss? Shoul I specify the analizer in the queryes too?

demjened · February 15, 2024, 5:42pm

Hi Marco and welcome to the Elastic community!

You are on the right track, but in order for the title field to be analyzed with the italian_full analyzer (and support stemming in Italian), you need to explicitly specify that in the field's mapping when you create the index:

PUT index_it
{
  "settings": { ... }, // your settings as above
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "italian_full"
      }
    }
  }
}

This configures two things: 1. title will be analyzed with italian_full, rather than with the default analyzer, and 2. queries against title will also be analyzed with italian_full, so you can run searches in Italian.

PUT index_it/_doc/1
{
  "title": "torta"
}

GET index_it/_search
{
  "query": {
    "match": {
      "title": "torte"
    }
  }
}

{
...
    "hits": [
      {
        "_index": "index_it",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "title": "torta"
        }
      }
    ]
}

For more info, please check out this guide on specifying an analyzer.

If you want the Italian analyzer to apply to multiple fields in the index by default (e.g. to all text fields), consider using an index template.

Marco_Solari · February 15, 2024, 8:27pm

Thank you so much! It works like a charm... Wonderful!
It's so nice too to enter such a viable and responsive community!

system · March 14, 2024, 8:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Italian language support and stemming Elasticsearch	3	1430	July 6, 2017
Stemmer not working [ES 6.7.1] Elasticsearch	2	490	May 7, 2019
Text not stemmed after inserted in the index with language specific analyzer Elasticsearch	1	191	December 6, 2021
Lemmatizer for Italian and English languages for ES 2.3.4 Elasticsearch	6	2057	July 5, 2017
Phrase suggester not working well Elasticsearch	1	587	July 5, 2017

How to search with correct stemming?

Related topics