Compound words handling

How can I get ES to search for the words "data base" and "database" in the index when the query word is "database" or "data base"?
another example : when the user queries the word "clean up" or "cleanup", ES should search for both "clean up" and "cleanup".

I tried decompounding using the following code :

es.indices.create(
  index= "zentest",
  body= {
    "settings": {
      "analysis": {
        "analyzer": {
          "standard_dictionary_decompound": {
            "tokenizer": "standard",
            "filter": [ "dictionary_decompound" ]
          }
        },
        "filter": {
          "dictionary_decompound": {
            "type": "dictionary_decompounder",
            "word_list_path": "decompound_words.txt",
            "max_subword_size": 22
          }
        }
      }
    }
  }
)

the .txt file has the following words, one on each line
database
cleanup

Didn't work. the text has "clean up". When I query "clean up" I get a hit. I get nothing when I query "cleanup".

Do I need to filter using synonyms ?

synonyms would be an alternative here indeed.

Maybe you can explain why you went with the decompounder in the first case to avoid any confusion.

--Alex

cleanup and database could be considered as closed compound words. just like basketball. i also wanted to know what ES considers compound words in English.

i suppose i could use (cleanup, clean) and (database,data) as synonyms, though it seems hacky.

using cleanup and clean as synonyms didn't work.
I get a hit when I use clean but none when I use cleanup.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.