I want to do a partial match on text fields (these fields include Japanese text data, but more on that later, I want to nail the core functionality correctly before going into Japanese.) without using wildcards.
Here are some example matches I need:
--> [results]
- "beatl" --> ["beatles", "beatl", ...]
- "ance" --> ["dance", "chance", ...]
- "bieb" --> ["Bieber", "Doobie Brothers", ...]
My experiments with wildcards:
- I made the fields non-analyzed and used wildcard queries.
- The performance was abysmal Since the queries I wrote included a leading asterisk (my data size is 70 million+ documents).
- And over that I need to use
multi_match
over a set of text fields, So in all wildcards did not work.
My experiments with ngrams:
- I used min and max-gram = 3, and tested. Below are the bad results that I don't want to show up.
- "Beatl" --> ["atlas beaz feat", ...] (I could understand why this showed up. It was since it contained all the 3-grams from Beatl, But I don't want such results)
Some thoughts to improve with the n-gram search situation:
- I wonder if I can do a post filter after receiving the results from Ngram. using wildcards here may not be as expensive.
- Can I give a higher rating (or filtering) to n-grams that are overlapping? So that results like "Atlas Beaz feat" will not show up.
Is there any other way to make partial match possible?
Here is my index settings with n-grams.
"settings":{
"analysis":{
"char_filter":{
"remove_spaces": {
"type": "pattern_replace",
"pattern": "(\\s)",
"replacement": ""
}
},
"tokenizer":{
"eng_ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
],
"custom_token_chars": ["~","[","]","・"]
}
},
"filter":{
"custom_asciifolding":{
"type":"asciifolding",
"preserve_original":true
}
},
"analyzer":{
"eng_ngram_analyzer": {
"type": "custom",
"char_filter": [
"remove_spaces"
],
"tokenizer": "eng_ngram_tokenizer",
"filter": [
"lowercase",
"custom_asciifolding"
]
}
}
}
},
...