For our logs, the average size of a doc is 500KB to 1MB, but most of the time, the size in ES is smaller than the raw size. That could be because of our mappings.
I believe that for logs, about 30% of the fields are used for full text search or aggregation, the rest should be set to either "index": "not_analyzed"
or "index": "no"
.
Before indexing a new log type in ES, I pass the logs through Logstash and review the fields to decide which field should be indexed. Below is our default mapping for logs:
"mappings": {
"_default_": {
"dynamic_templates": [
{
"string_fields": {
"mapping": {
"index": "not_analyzed",
"omit_norms": true,
"type": "string"
},
"match_mapping_type": "string",
"match": "*"
}
}
],
"include_in_all": false,
"properties": {
"@timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"full_text_search_and_aggregation": {
"include_in_all": true,
"type": "string",
"index": "not_analyzed"
},
"full_text_search_by_field_name": {
"type": "string",
"index": "analyzed"
},
"no_search_or_aggregation": {
"type": "string",
"index": "no"
}
}
}
For user convenience, I include the fields that need full text search into the _all
field so that users can search without entering the field name. "include_in_all: false
could be changed at any time, which is not the case for indexing type.
I've seen cases when an index size is 3x larger than it should be due to unnecessary mappings (using NGram and Edge NGram). The index that holds the tokens is 2x larger than the logs themselves which requires lots of resources and is very slow.