UAX29 URL Email Tokenizer

I am wondering if there are plans to implement a more advanced version of the current URL Email Tokenizer, such as the UAX29 one provided by solr (https://lucene.apache.org/solr/guide/7_3/tokenizers.html#uax29-url-email-tokenizer)

The most useful features for us would be:

  • Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
  • Support for proper tokenization of IP addresses (in particular, IPv6 addresses are not tokenized properly with the ES version of the url email tokenizer).

If not, how could I go about implementing a custom tokenizer?

1 Like

can you share a concrete example what is not working? This one here for example works in 6.6.0

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "http://[2001:4860:0:2001::68]/ test me"
}

Thanks!

I apologize for not being more specific. We would like for proper tokenization of IPv6 addresses not embedded in a URL or email address. Please see my example below demonstrating the 2 features of the UAX29-url-email-tokenizer in Solr that we are looking for.

Ideally, we would like two tokens returned: "2001:4860:0:2001::68" and "CVE-123-456". That's currently not the case.

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "2001:4860:0:2001::68 CVE-123-456"
}

Thanks for the help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.