We have an Elasticsearch index with over 100 fields. We explicitly define only a small number of the fields in our index mapping and enable dynamic mapping for everything else.
A number of the fields are chunks of markup up to 100kb in size. We only need to be able to check for their existence, so analysing them is not necessary.
What is the most efficient mapping to use for this use case?
We have so far considered the following settings for these fields (or applied as defaults for all text fields using dynamic templates):
Keeping the field type as "text" but setting the index_options parameter to "docs" only
Changing the type to "keyword" and defining doc_values as false
However we are unsure which strategy is the better option.
So you just need to know that the field exists in the document, and don't care at all about the contents?
There are a couple options you could play with. I'd definitely avoid the keyword approach though. All those markups will generate a unique token, which will be very bad for compression since each one will have to be represented in the term dictionary.
Note the first two options are a bit hacky
Truncated Analyzer Setup
text field with index_options set to "docs"
Then create a custom analyzer with:
keyword tokenizer to generate a single token
truncate filter with the truncate size set to 1, then lowercase filter.
That will generate a field that has a single, lowercased token. Which means it'll be very low cardinality which is good for compression, since the dictionary will be tiny and most docs will share the same tokens.
Pattern Replace Analyzer setup
Sorta like the above, the goal is to reduce to a single token.
text field with index_options set to "docs"
Then create a custom analyzer with:
keyword tokenizer to generate a single token
pattern_replace filter which matches everything (*) and replaces it with a single token (true)
No Indexing approach
A less hacky approach is to set the field to index: no so that nothing is indexed at all. Then at ingest time, your application (or logstash, or ingest node, etc) adds an additional has_field: true/false to the document. You can then use that to check for existence, and still get the actual markup from the _source
This would probably be semantically a lot cleaner. It won't confuse anyone accidentally searching the field because the search will just fail, instead of matching weird tokens It'd also be the best for compression since the boolean field is lightweight, and the markup field won't have any text at all.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.