Hello,
I have a problem searching partial URLs in a text field. I'm using a a word_delimiter filter to split possible URLs. Here is the mappings:
{
"settings":{
"analysis":{
"filter":{
"my_word_delimiter":{
"type":"word_delimiter",
"catenate_words":false,
"catenate_numbers":false,
"split_on_numerics":false,
"split_on_case_change":false
},
"english_possessive_stemmer":{
"type":"stemmer",
"name":"possessive_english"
},
"english_stop":{
"type":"stop",
"stopwords":"_english_"
},
"english_plural_stemmer":{
"type":"stemmer",
"name":"minimal_english"
}
},
"analyzer":{
"custom_analyzer":{
"tokenizer": "whitespace",
"filter":[
"my_word_delimiter",
"lowercase",
"english_possessive_stemmer", "english_plural_stemmer",
"english_stop"
]
}
}
}
},
"mappings": {
"test": {
"properties": {
"body": {
"type": "text",
"analyzer": "custom_analyzer"
}
}
}
}
}
When running a search or aggregating data this leads to unexpected results.
Let's consider the following document:
{
"body": "www.google.co.uk hello com"
}
It generates 6 tokens: www, google, co, uk hello, com.
If an user searches "google.com", ES returns the document above. Even if it is technically correct, that is what you don't expect.
So I was thinking to implement a filter to parse URLs.
The filter is supposed to generate the following tokens for the token "www.google.co.uk": www.google.co.uk (original), google.co.uk, google.
Then at query time, I would a simple analyser that doesn't tokenise data. So if the user searches "google" or "google.co.uk", he will get proper results.
What do you think?