UAX_URL_EMAIL tokenizer problem


#1

Hey guys,

does uax_url_email tokenizer not work with short urls like "google24.com"? Was wondering if you guys were getting the same results and also what recommendations you guys have for workarounds

Goal: trying to tokenize "google24.com"
Problem: Trying to correctly tokenize: "google24.com"
Result: I get two tokens: "google24" and "com"

input (fail):

curl -XPOST 'localhost:9200/_analyze?pretty' -d'
{
    "tokenizer": "uax_url_email",
    "text": "google24.com"
}'

output:

{
  "tokens" : [ {
    "token" : "google24",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "com",
    "start_offset" : 9,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

input (pass):

curl -XPOST 'localhost:9200/_analyze?pretty' -d'
{
    "tokenizer": "uax_url_email",
    "text": "google.com"
}'

output, notice that the input is detected as alphanum and not url

{
  "tokens" : [ {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 0
  } ]
}

input (pass), notice the space character after the ".com"

curl -XPOST localhost:9200/_analyze?pretty' -d'
{
  "tokenizer": "uax_url_email",
  "text": "google24.com "
}'

output

{
  "tokens" : [ {
    "token" : "google24.com",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "<URL>",
    "position" : 0
  } ]
}

Questions:

  1. why does adding a space after the url in the 3rd example produce the desired results?
  2. why does it appear that urls with numeric characters before the ".com" arent being tokenized correctly

Thank you everyone for you help!


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.