Index text as keyword array leveraging tokenizers and filters

rsinger86 · June 21, 2020, 7:47pm

Say I am indexing emails as documents into Elasticsearch, and I'd like to index the email body as an array of keywords, in addition to indexing it as an analyzed text field.

My purpose is that I'd like to do term aggregations on the words in the email body field, but I don't want to enable fielddata. Indexing using the keyword datatype (as an array) allows me to have these stored as doc_values on disk.

For example if I want to index this text:
"Greetings user: Welcome to our app! We are hopeful that you will have an enjoyable time!"

In my application, I could naively convert this text to an array, by breaking on whitespace and lowercasing the elements, so that I index a keyword array that looks like this:
["greetings", "user:", "welcome", "to", "our", "app", "we", "are", "hopeful", "that", "you", "will", "have", "an", "enjoyable", "time!']

But it would be nice if I could use Elasticsearch's built-in tokenizers and filters to remove the stop words, more intelligently break the text into words, and do stemming, so that I index a keyword array like this:
["greet", "user", "welcome", "app", "hope", "enjoy", "time']

Is there anyway I can set up a keyword mapping that leverages the text analyzers in this way? Here is the pseudo mapping that I'm going for:

{
  "mappings": {
    "properties": {
      "email_body_words": {
        "type":  "keyword",
        "split_text_into_array_of_k_stemmed_tokens_and_remove_stop_words": true
      }
    }
  }
}

system · July 19, 2020, 7:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
String to Keywords Elasticsearch	1	425	March 31, 2018
Mapping: Array Type vs Text Type(with fielddata set to true) Elasticsearch	2	926	March 3, 2017
How are keywords indexed Elasticsearch	2	392	July 22, 2019
Elasticsearch array field of keywords - how to index it? Elasticsearch	1	692	July 6, 2017
ElasticSearch term Aggregation on text fields Elasticsearch	1	409	June 25, 2020

Index text as keyword array leveraging tokenizers and filters

Related topics