Inconsistent behaviour of the UAX URL email tokenizer?

The following two queries extract urls in different ways and I am wondering whether this is a bug. In the first case the ' following the url is part of the extracted token whereas in the second case it is not. This seems to depend on whether the url ends with a path or with search parameters or more specifically, whether the url contains a ? or not. This has been tested with version 7.11.2 in the Dev Tools section of Kibana.

First case:

GET /_analyze
{
  "tokenizer" : "uax_url_email",
  "text" : "src='http://example.com/?id=100'"
}

Response:

{
  "tokens" : [
    {
      "token" : "src",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "http://example.com/?id=100'",
      "start_offset" : 5,
      "end_offset" : 32,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

Second case:

GET /_analyze
{
  "tokenizer" : "uax_url_email",
  "text" : "src='http://example.com/path/to/script.js'"
}

Response:

{
  "tokens" : [
    {
      "token" : "src",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "http://example.com/path/to/script.js",
      "start_offset" : 5,
      "end_offset" : 41,
      "type" : "<URL>",
      "position" : 1
    }
  ]
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.