Hey guys,
does uax_url_email tokenizer not work with short urls like "google24.com"? Was wondering if you guys were getting the same results and also what recommendations you guys have for workarounds
Goal: trying to tokenize "google24.com"
Problem: Trying to correctly tokenize: "google24.com"
Result: I get two tokens: "google24" and "com"
input (fail):
curl -XPOST 'localhost:9200/_analyze?pretty' -d'
{
"tokenizer": "uax_url_email",
"text": "google24.com"
}'
output:
{
"tokens" : [ {
"token" : "google24",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
}, {
"token" : "com",
"start_offset" : 9,
"end_offset" : 12,
"type" : "<ALPHANUM>",
"position" : 1
} ]
}
input (pass):
curl -XPOST 'localhost:9200/_analyze?pretty' -d'
{
"tokenizer": "uax_url_email",
"text": "google.com"
}'
output, notice that the input is detected as alphanum and not url
{
"tokens" : [ {
"token" : "google.com",
"start_offset" : 0,
"end_offset" : 10,
"type" : "<ALPHANUM>",
"position" : 0
} ]
}
input (pass), notice the space character after the ".com"
curl -XPOST localhost:9200/_analyze?pretty' -d'
{
"tokenizer": "uax_url_email",
"text": "google24.com "
}'
output
{
"tokens" : [ {
"token" : "google24.com",
"start_offset" : 0,
"end_offset" : 12,
"type" : "<URL>",
"position" : 0
} ]
}
Questions:
- why does adding a space after the url in the 3rd example produce the desired results?
- why does it appear that urls with numeric characters before the ".com" arent being tokenized correctly
Thank you everyone for you help!