Multiple analyzers with stemmed synonyms

I want to create an index with a stemmer analyzer to generalize my synonyms and apply it in other analyzers.
For a simplified example: I want to use all these synonyms [beautiful, pretty, beauteous, gorgeous] in multiple analyzers when searching for beauty, once beauty and beautiful have the same stem word

GET /_analyzer
{
  "tokenizer": "standard",
  "filter": [ "stemmer" ],
  "text": "beautiful beauty"
}
{
  "tokens": [
    {
      "token": "beauti",  ...
    },
    {
      "token": "beauti", ...
    }
  ]
}

What I have so far is

PUT /test_synonyms
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_synonyms": {
            "type": "synonym",
            "synonyms":  ["beautiful, pretty, beauteous, gorgeous"]
          },
          "my_metaphone": {
            "type": "phonetic",
            "encoder": "metaphone",
            "replace": true
          }
        }
      }
    }
  }
}

Stemmer analyzer gives me:

GET /test_synonyms/_analyzer
{
  "tokenizer": "standard",
  "filter": ["stemmer", "my_synonyms"],
  "text": "beauty"
}
{
  "tokens": [
    {
      "token": "beauti",  ...
    },
    {
      "token": "pretti", ...
    },
    {
      "token": "beauteo", ...
    },
    {
      "token": "gorgeou", ...
    }
  ]
}

Phonetic analyzer gives me:

GET /test_synonyms/_analyzer
{
  "tokenizer": "standard",
  "filter": ["my_synonyms", "my_metaphone"],
  "text": "beauty"
}
{
  "tokens": [
    {
      "token": "BT",  ...
    }
  ]
}

Once "BT" doesn't match with any of the tokens:

GET /test_synonyms/_analyzer
{
  "tokenizer": "standard",
  "filter": ["my_synonyms", "my_phonetic"],
  "text": "beautiful"
}
{
  "tokens": [
    {
      "token": "BTFL",  ... /*beautiful*/
    },
    {
      "token": "PRT", ... /*pretty*/
    },
    {
      "token": "BTS", ... /*beauteous*/
    },
    {
      "token": "KRJS", ... /*gorgeous*/
    }
  ]
}

I was wondering if there is a way to return the exact synonym words (not their stem), but still use stemmer to find them, and then use this with other analyzers.. Something to give me the response above when searching for beauty

I tried to use the stemmer and phonetic filters together, but it gives me:

GET /test_synonyms/_analyzer
{
  "tokenizer": "standard",
  "filter": ["stemmer", "my_synonyms", "my_phonetic"],
  "text": "beauty" /*or beautiful (equal responses)*/
}
{
  "tokens": [
    {
      "token": "BT", ... /*beauti*/
    },
    {
      "token": "PRT", ... /*pretti*/
    },
    {
      "token": "BT", ... /*beauteou*/
    },
    {
      "token": "KRJ", ... /*gorgeou*/
    }
  ]
}

And this isn't what I really want, cuz when I search for "beautiful" and "beauty", the number of documents returned are differents (beautiful score the phonetic matches), and I want them to be the same.

I don't understand why you would you want to do that? If you index your documents using the stemmer, docs with "beauteous" in the input will have the stemmed version written to the index. When you search them later e.g. via synonym expansion you want the same stemmer being aplied to them, otherwise you will not match the intended documents.

Specifically:

"a gorgeous boat" will index "gorgeou" when using a stemmer.
"beauty" at search time will expand to "gorgeou", otherwise it wouldn't match the document

Am I missing something?

Hi @cbuescher, thank you for your reply. I'm sorry, my final goal was not as simple as I made it look. I'm new to elastic and my problem is related to specific Portuguese cases. I updated my question! Please let me know if it makes a bit more sense now or if I'm going in the wrong direction.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.