I need "wifi" to find "Wi-Fi"

I've really been scratching my head on this, but nothing seems to be
working. So far, I have everything working as expected, but what I wan't
now is when someone types "wifi", it searches results with "Wi-Fi"

I have word-delimiter on, and have made all recommended changes, but I'm
still not seeing expected results.

{
"settings": {
"analysis" : {
"analyzer" : {
"translation_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard","lowercase","translation_ngram",
"my_word_delimiter" ]
},
...

    "filter" : {
            "translation_ngram": {
                "type": "nGram",
                "min_gram": 1,
                "max_gram": 10
            },
            "my_word_delimiter" : {
                "type": "word_delimiter",
                "catenate_words" : true,
                "split_on_case_change" : false,
                "preserve_original" : true
            }

From what I've read "catenate_words" should turn "wi-fi" into "wifi" when
analyzed.... and the "lowercase" should help with the capital "W" and "F"
in "Wi-Fi". I was thinking maybe the "split_on_case_change" was an issue,
but I've set that to false too.

What am I missing here? Any help would be appreciated. Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc15349b-bbee-474a-a425-daa25c55e831%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Raja,

You have a couple of issues. First, you use the standard tokenizer, that
one splits words on a - , so you already loose the Wi-FI combination
there. Next the tokens are passed through an ngram filter, which chops them
up more (may or may not be what you want). You can see what happens by
using the _analyzer API:

You can see the result using the _analyze API:

curl -XGET
"http://localhost:9200/test/_analyze/?text=Wi-Fi&analyzer=translation_index_analyzer"

With these settings (using a white space tokenizer, removing the ngram
filters and tweaking the word_delimiter a bit), you'd see the difference:

curl -XPUT "http://localhost:9200/test" -d'
{
"settings": {
"analysis": {
"analyzer": {
"translation_index_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"my_word_delimiter"
]
}
},
"filter": {
"translation_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 10
},
"my_word_delimiter": {
"type": "word_delimiter",
"catenate_words": true,
"split_on_case_change": false
}
}
}
}
}'

Cheers,
Boaz

On Monday, March 17, 2014 8:31:47 PM UTC+1, Raja Akhtar wrote:

I've really been scratching my head on this, but nothing seems to be
working. So far, I have everything working as expected, but what I wan't
now is when someone types "wifi", it searches results with "Wi-Fi"

I have word-delimiter on, and have made all recommended changes, but I'm
still not seeing expected results.

{
"settings": {
"analysis" : {
"analyzer" : {
"translation_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard","lowercase","translation_ngram"
,"my_word_delimiter" ]
},
...

    "filter" : {
            "translation_ngram": {
                "type": "nGram",
                "min_gram": 1,
                "max_gram": 10
            },
            "my_word_delimiter" : {
                "type": "word_delimiter",
                "catenate_words" : true,
                "split_on_case_change" : false,
                "preserve_original" : true
            }

From what I've read "catenate_words" should turn "wi-fi" into "wifi" when
analyzed.... and the "lowercase" should help with the capital "W" and "F"
in "Wi-Fi". I was thinking maybe the "split_on_case_change" was an issue,
but I've set that to false too.

What am I missing here? Any help would be appreciated. Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/af94f90c-9d30-4fe6-a392-687d63bf6895%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1 Like

Hi Boaz,

Thank you so much. That worked like a charm! I can't believe I missed that.
I really appreciate the quick response :slight_smile:

Cheers,
Raja

On Monday, 17 March 2014 21:29:30 UTC, Boaz Leskes wrote:

Hi Raja,

You have a couple of issues. First, you use the standard tokenizer, that
one splits words on a - , so you already loose the Wi-FI combination
there. Next the tokens are passed through an ngram filter, which chops them
up more (may or may not be what you want). You can see what happens by
using the _analyzer API:

You can see the result using the _analyze API:

curl -XGET "
http://localhost:9200/test/_analyze/?text=Wi-Fi&analyzer=translation_index_analyzer
"

With these settings (using a white space tokenizer, removing the ngram
filters and tweaking the word_delimiter a bit), you'd see the difference:

curl -XPUT "http://localhost:9200/test" -d'
{
"settings": {
"analysis": {
"analyzer": {
"translation_index_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"standard",
"lowercase",
"my_word_delimiter"
]
}
},
"filter": {
"translation_ngram": {
"type": "nGram",
"min_gram": 1,
"max_gram": 10
},
"my_word_delimiter": {
"type": "word_delimiter",
"catenate_words": true,
"split_on_case_change": false
}
}
}
}
}'

Cheers,
Boaz

On Monday, March 17, 2014 8:31:47 PM UTC+1, Raja Akhtar wrote:

I've really been scratching my head on this, but nothing seems to be
working. So far, I have everything working as expected, but what I wan't
now is when someone types "wifi", it searches results with "Wi-Fi"

I have word-delimiter on, and have made all recommended changes, but I'm
still not seeing expected results.

{
"settings": {
"analysis" : {
"analyzer" : {
"translation_index_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [ "standard","lowercase",
"translation_ngram","my_word_delimiter" ]
},
...

    "filter" : {
            "translation_ngram": {
                "type": "nGram",
                "min_gram": 1,
                "max_gram": 10
            },
            "my_word_delimiter" : {
                "type": "word_delimiter",
                "catenate_words" : true,
                "split_on_case_change" : false,
                "preserve_original" : true
            }

From what I've read "catenate_words" should turn "wi-fi" into "wifi" when
analyzed.... and the "lowercase" should help with the capital "W" and "F"
in "Wi-Fi". I was thinking maybe the "split_on_case_change" was an issue,
but I've set that to false too.

What am I missing here? Any help would be appreciated. Thanks.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f5afa721-e17b-4f8e-9482-f65a206d87f5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.