Efficient storage of non-analysed text fields in Elasticsearch

(Matt) #1

We have an Elasticsearch index with over 100 fields. We explicitly define only a small number of the fields in our index mapping and enable dynamic mapping for everything else.

A number of the fields are chunks of markup up to 100kb in size. We only need to be able to check for their existence, so analysing them is not necessary.

What is the most efficient mapping to use for this use case?

We have so far considered the following settings for these fields (or applied as defaults for all text fields using dynamic templates):

  • Keeping the field type as "text" but setting the index_options parameter to "docs" only
  • Changing the type to "keyword" and defining doc_values as false

However we are unsure which strategy is the better option.

(Zachary Tong) #2

So you just need to know that the field exists in the document, and don't care at all about the contents?

There are a couple options you could play with. I'd definitely avoid the keyword approach though. All those markups will generate a unique token, which will be very bad for compression since each one will have to be represented in the term dictionary.

Note the first two options are a bit hacky :slight_smile:

Truncated Analyzer Setup

  • text field with index_options set to "docs"
  • Then create a custom analyzer with:
    • keyword tokenizer to generate a single token
    • truncate filter with the truncate size set to 1, then lowercase filter.

That will generate a field that has a single, lowercased token. Which means it'll be very low cardinality which is good for compression, since the dictionary will be tiny and most docs will share the same tokens.

Pattern Replace Analyzer setup

Sorta like the above, the goal is to reduce to a single token.

  • text field with index_options set to "docs"
  • Then create a custom analyzer with:
    • keyword tokenizer to generate a single token
    • pattern_replace filter which matches everything (*) and replaces it with a single token (true)

No Indexing approach

A less hacky approach is to set the field to index: no so that nothing is indexed at all. Then at ingest time, your application (or logstash, or ingest node, etc) adds an additional has_field: true/false to the document. You can then use that to check for existence, and still get the actual markup from the _source

This would probably be semantically a lot cleaner. It won't confuse anyone accidentally searching the field because the search will just fail, instead of matching weird tokens :slight_smile: It'd also be the best for compression since the boolean field is lightweight, and the markup field won't have any text at all.

If possible, I'd try to use this option.

(Matt) #3

Thank you for your response. Testing the first option looks to gave shaved around 30% off of our current index size which is an excellent result.

(Zachary Tong) #4

Awesome, happy to help! Goodluck :slight_smile:

(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.