Efficient storage of non-analysed text fields in Elasticsearch

i-like-robots · May 3, 2018, 9:34am

We have an Elasticsearch index with over 100 fields. We explicitly define only a small number of the fields in our index mapping and enable dynamic mapping for everything else.

A number of the fields are chunks of markup up to 100kb in size. We only need to be able to check for their existence, so analysing them is not necessary.

What is the most efficient mapping to use for this use case?

We have so far considered the following settings for these fields (or applied as defaults for all text fields using dynamic templates):

Keeping the field type as "text" but setting the index_options parameter to "docs" only
Changing the type to "keyword" and defining doc_values as false

However we are unsure which strategy is the better option.

polyfractal · May 3, 2018, 4:25pm

So you just need to know that the field exists in the document, and don't care at all about the contents?

There are a couple options you could play with. I'd definitely avoid the keyword approach though. All those markups will generate a unique token, which will be very bad for compression since each one will have to be represented in the term dictionary.

Note the first two options are a bit hacky

Truncated Analyzer Setup

text field with index_options set to "docs"
Then create a custom analyzer with:
- keyword tokenizer to generate a single token
- truncate filter with the truncate size set to 1, then lowercase filter.

That will generate a field that has a single, lowercased token. Which means it'll be very low cardinality which is good for compression, since the dictionary will be tiny and most docs will share the same tokens.

Pattern Replace Analyzer setup

Sorta like the above, the goal is to reduce to a single token.

text field with index_options set to "docs"
Then create a custom analyzer with:
- keyword tokenizer to generate a single token
- pattern_replace filter which matches everything (*) and replaces it with a single token (true)

No Indexing approach

A less hacky approach is to set the field to index: no so that nothing is indexed at all. Then at ingest time, your application (or logstash, or ingest node, etc) adds an additional has_field: true/false to the document. You can then use that to check for existence, and still get the actual markup from the _source

This would probably be semantically a lot cleaner. It won't confuse anyone accidentally searching the field because the search will just fail, instead of matching weird tokens It'd also be the best for compression since the boolean field is lightweight, and the markup field won't have any text at all.

If possible, I'd try to use this option.

i-like-robots · May 8, 2018, 3:56pm

Thank you for your response. Testing the first option looks to gave shaved around 30% off of our current index size which is an excellent result.

polyfractal · May 8, 2018, 3:57pm

Awesome, happy to help! Goodluck

system · June 5, 2018, 3:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Efficient storage Elasticsearch	4	412	November 21, 2019
Multi Fields Diadvantage Elasticsearch	7	443	June 13, 2020
Performance of doc_values field vs analysed field Elasticsearch	4	1651	October 18, 2017
Is KEYWORD data type analyzed as well? Elasticsearch	3	1528	February 14, 2017
Analyze on text field Elasticsearch	7	605	June 13, 2018

Efficient storage of non-analysed text fields in Elasticsearch

Related topics