Inconsistent behaviour of the UAX URL email tokenizer?

jakopako · March 17, 2021, 11:59am

The following two queries extract urls in different ways and I am wondering whether this is a bug. In the first case the ' following the url is part of the extracted token whereas in the second case it is not. This seems to depend on whether the url ends with a path or with search parameters or more specifically, whether the url contains a ? or not. This has been tested with version 7.11.2 in the Dev Tools section of Kibana.

First case:

GET /_analyze
{
  "tokenizer" : "uax_url_email",
  "text" : "src='http://example.com/?id=100'"
}

Response:

{
  "tokens" : [
    {
      "token" : "src",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "http://example.com/?id=100'",
      "start_offset" : 5,
      "end_offset" : 32,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

Second case:

GET /_analyze
{
  "tokenizer" : "uax_url_email",
  "text" : "src='http://example.com/path/to/script.js'"
}

Response:

{
  "tokens" : [
    {
      "token" : "src",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "http://example.com/path/to/script.js",
      "start_offset" : 5,
      "end_offset" : 41,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

system · April 14, 2021, 11:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
UAX_URL_EMAIL tokenizer problem Elasticsearch	1	357	February 16, 2017
Inconsistent number of tokens with uax_url_email Elasticsearch	3	560	July 5, 2017
Searching for a URL Elasticsearch	1	408	September 12, 2018
Problem with uax_url_email tokenizer Elasticsearch	2	388	July 6, 2017
UAX29 URL Email Tokenizer Elasticsearch	3	611	March 15, 2019

Inconsistent behaviour of the UAX URL email tokenizer?

Related topics