Beider_morse phonetic encoder silently fails when languageset not specified

Hi, I'm considering using ElasticSearch for a project and ran into an issue with beider_morse phonetic encoding. I need Beider Morse language detection, so based on the docs at https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic-token-filter.html I used the following:

curl -XDELETE 'http://localhost:9200/phonetictest?pretty'
curl -XPUT 'http://localhost:9200/phonetictest?pretty' -d'{
  "settings": {
    "analysis": {
      "filter": {
        "beider_morse_filter": { 
          "type":    "phonetic",
          "encoder": "beider_morse",
          "name_type": "generic"
        }
      },
      "analyzer": {
        "my_beider_morse": {
          "tokenizer": "standard",
          "filter":    "beider_morse_filter" 
        }
      }
    }
  }
}'


curl -XGET 'http://localhost:9200/phonetictest/_analyze?pretty&analyzer=my_beider_morse' -d'ABADIAS'

Incorrectly returns:

{
  "tokens" : [
    {
      "token" : "ABADIAS",
      "start_offset" : 0,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

Expected token list based on the current BMPM PHP code at http://stevemorse.org/phoneticinfo.htm :

abadias abadia abadios abadio abodias abodia abodios abodio abYdias abYdios avadias avadios avodias avodios obadias obadia obadios obadio obodias obodia obodios obodio obYdias obYdios ovadias ovadios ovodias ovodios Ybadias Ybadios Ybodias Ybodios YbYdias YbYdios abadiaS abadioS abodiaS abodioS obadiaS obadioS obodiaS obodioS

Questions:

  1. How can I encode with automatic Beider Morse language detection?
  2. For verification before moving forward with the project, which version of BMPM is the implementation based on?

Thanks,
Ben

P.S. The documentation at https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-phonetic-token-filter.html has a mistake. "Comomon" is not a possible languageset value. In addition, the corrected spelling "common" is not possible either.

Thanks for opening an issue in our github repo: https://github.com/elastic/elasticsearch/issues/26771 . We will certainly have a look at it.

Thanks. I'm surprised there would be such a fundamental bug with a major phonetic analyser and wonder if I am doing something wrong? Or if older versions exhibited this bug too? I'm surprised no one would have caught it, which is why I imagine it must be something wrong on my end.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.