UAX29 URL Email Tokenizer

avencar · February 13, 2019, 10:52pm

I am wondering if there are plans to implement a more advanced version of the current URL Email Tokenizer, such as the UAX29 one provided by solr (https://lucene.apache.org/solr/guide/7_3/tokenizers.html#uax29-url-email-tokenizer)

The most useful features for us would be:

Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
Support for proper tokenization of IP addresses (in particular, IPv6 addresses are not tokenized properly with the ES version of the url email tokenizer).

If not, how could I go about implementing a custom tokenizer?

spinscale · February 15, 2019, 9:08am

can you share a concrete example what is not working? This one here for example works in 6.6.0

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "http://[2001:4860:0:2001::68]/ test me"
}

Thanks!

avencar · February 15, 2019, 3:26pm

I apologize for not being more specific. We would like for proper tokenization of IPv6 addresses not embedded in a URL or email address. Please see my example below demonstrating the 2 features of the UAX29-url-email-tokenizer in Solr that we are looking for.

Ideally, we would like two tokens returned: "2001:4860:0:2001::68" and "CVE-123-456". That's currently not the case.

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "2001:4860:0:2001::68 CVE-123-456"
}

Thanks for the help!

system · March 15, 2019, 3:26pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.