UAX29 URL Email Tokenizer


(Andre E) #1

I am wondering if there are plans to implement a more advanced version of the current URL Email Tokenizer, such as the UAX29 one provided by solr (https://lucene.apache.org/solr/guide/7_3/tokenizers.html#uax29-url-email-tokenizer)

The most useful features for us would be:

  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
  • Support for proper tokenization of IP addresses (in particular, IPv6 addresses are not tokenized properly with the ES version of the url email tokenizer).

If not, how could I go about implementing a custom tokenizer?


(Alexander Reelsen) #2

can you share a concrete example what is not working? This one here for example works in 6.6.0

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "http://[2001:4860:0:2001::68]/ test me"
}

Thanks!


(Andre E) #3

I apologize for not being more specific. We would like for proper tokenization of IPv6 addresses not embedded in a URL or email address. Please see my example below demonstrating the 2 features of the UAX29-url-email-tokenizer in Solr that we are looking for.

Ideally, we would like two tokens returned: "2001:4860:0:2001::68" and "CVE-123-456". That's currently not the case.

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "2001:4860:0:2001::68 CVE-123-456"
}

Thanks for the help!