The following two queries extract urls in different ways and I am wondering whether this is a bug. In the first case the ' following the url is part of the extracted token whereas in the second case it is not. This seems to depend on whether the url ends with a path or with search parameters or more specifically, whether the url contains a ? or not. This has been tested with version 7.11.2 in the Dev Tools section of Kibana.
First case:
GET /_analyze
{
"tokenizer" : "uax_url_email",
"text" : "src='http://example.com/?id=100'"
}
Response:
{
"tokens" : [
{
"token" : "src",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "http://example.com/?id=100'",
"start_offset" : 5,
"end_offset" : 32,
"type" : "<URL>",
"position" : 1
}
]
}
Second case:
GET /_analyze
{
"tokenizer" : "uax_url_email",
"text" : "src='http://example.com/path/to/script.js'"
}
Response:
{
"tokens" : [
{
"token" : "src",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "http://example.com/path/to/script.js",
"start_offset" : 5,
"end_offset" : 41,
"type" : "<URL>",
"position" : 1
}
]
}