Index text as keyword array leveraging tokenizers and filters

Say I am indexing emails as documents into Elasticsearch, and I'd like to index the email body as an array of keywords, in addition to indexing it as an analyzed text field.

My purpose is that I'd like to do term aggregations on the words in the email body field, but I don't want to enable fielddata. Indexing using the keyword datatype (as an array) allows me to have these stored as doc_values on disk.

For example if I want to index this text:
"Greetings user: Welcome to our app! We are hopeful that you will have an enjoyable time!"

In my application, I could naively convert this text to an array, by breaking on whitespace and lowercasing the elements, so that I index a keyword array that looks like this:
["greetings", "user:", "welcome", "to", "our", "app", "we", "are", "hopeful", "that", "you", "will", "have", "an", "enjoyable", "time!']

But it would be nice if I could use Elasticsearch's built-in tokenizers and filters to remove the stop words, more intelligently break the text into words, and do stemming, so that I index a keyword array like this:
["greet", "user", "welcome", "app", "hope", "enjoy", "time']

Is there anyway I can set up a keyword mapping that leverages the text analyzers in this way? Here is the pseudo mapping that I'm going for:

{
  "mappings": {
    "properties": {
      "email_body_words": {
        "type":  "keyword",
        "split_text_into_array_of_k_stemmed_tokens_and_remove_stop_words": true
      }
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.