Combining multiple Tokenizer features on single _all field


#1

We have several customer defined indices on ES 6 with 100+ fields, where each field has a copy_to mapping to an _all field. This allows us to perform full-text search over all user-defined fields in the index.

I have several specific tokenizer requirements for this (and any other) field in those indices:

  1. Emails should be tokenized as-is and not broken up: uax_url_email tokenizer
  2. Support non-western languages: icu_tokenizer
  3. (Company) domain names should be normalized without TLD ('Amazon.com' > 'Amazon'), so they will match queries without the '.com'.

I currently implemented 2. and 3. as follows in one Analyzer:

"analysis": {
  "filter": {
    "domain_name": {
      "type": "pattern_capture",
      "preserve_original": "true",
      "patterns": [
        "^(?:www\\.)?([^.]{3,})\\.[^.]+"
      ]
    }
  },
  "analyzer": {
    "icu": {
      "filter": [
        "icu_folding",
        "domain_name"
      ],
      "type": "custom",
      "tokenizer": "icu_tokenizer"
    }
  }
}

How can I also add requirement 1. to this to support email addresses? How can I somehow 'combine' the 2 different tokenizers?

Is there a better way to implement the (company) domain name tokenization?


#2

Does anyone have an idea on how to achieve the 3 different tokenizations of terms?


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.