Words are split at hyphens, unless there is a number in the word, in which case the token is not split and the numbers and hyphen(s) are preserved.
Support for proper tokenization of IP addresses (in particular, IPv6 addresses are not tokenized properly with the ES version of the url email tokenizer).
If not, how could I go about implementing a custom tokenizer?
I apologize for not being more specific. We would like for proper tokenization of IPv6 addresses not embedded in a URL or email address. Please see my example below demonstrating the 2 features of the UAX29-url-email-tokenizer in Solr that we are looking for.
Ideally, we would like two tokens returned: "2001:4860:0:2001::68" and "CVE-123-456". That's currently not the case.
POST _analyze
{
"tokenizer": "uax_url_email",
"text": "2001:4860:0:2001::68 CVE-123-456"
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.