We have several customer defined indices on ES 6 with 100+ fields, where each field has a copy_to mapping to an _all field. This allows us to perform full-text search over all user-defined fields in the index.
I have several specific tokenizer requirements for this (and any other) field in those indices:
- Emails should be tokenized as-is and not broken up: uax_url_email tokenizer
- Support non-western languages: icu_tokenizer
- (Company) domain names should be normalized without TLD ('Amazon.com' > 'Amazon'), so they will match queries without the '.com'.
I currently implemented 2. and 3. as follows in one Analyzer:
"analysis": {
"filter": {
"domain_name": {
"type": "pattern_capture",
"preserve_original": "true",
"patterns": [
"^(?:www\\.)?([^.]{3,})\\.[^.]+"
]
}
},
"analyzer": {
"icu": {
"filter": [
"icu_folding",
"domain_name"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
}
How can I also add requirement 1. to this to support email addresses? How can I somehow 'combine' the 2 different tokenizers?
Is there a better way to implement the (company) domain name tokenization?