What is taking up disk space if I disable indexing of all properties and _source?

I'm trying to restructure our Elasticsearch settings to be more space efficient, even if at the price of less comfort. I have a set of 150k documents from a live server where I'm trying to measure the impact of different settings. While trying to find the lowest possible space requirements (and admittedly, making the stored data basically useless), I tried disabling the indexing of all properties of the message and I disabled _source as well. However, every message still takes up about 300 bytes of space, resulting in 40MB consumed space with 150k "empty" documents (according to _cat/indices). If it makes any difference, I'm using elasticdump to move documents in 10k batches. What exactly is being stored? Can I remove this overhead somehow?

If I do a search, every document looks just like this:
{
"_index" : "index20",
"_id" : "Pz9NMH8BsRzEHi_3CFJ-",
"_score" : 1.0
},

This is how I set up the index before pushing in data:
{
"mappings":{
"_source":{"enabled":false},
"dynamic":"false",
"properties":{
"@timestamp":{"enabled":false},
"@version":{"enabled":false},
"headers":{"enabled":false},
"host":{"enabled":false},
"message":{"enabled":false},
"tags":{"enabled":false}
}
}
}

According to _stats, these are the largest parts of the index
"merges" : {
"current" : 0,
"current_docs" : 0,
"current_size_in_bytes" : 0,
"total" : 1,
"total_time_in_millis" : 356,
"total_docs" : 100000,
"total_size_in_bytes" : 28525466,
"total_stopped_time_in_millis" : 0,
"total_throttled_time_in_millis" : 0,
"total_auto_throttle_in_bytes" : 20971520
},
"bulk" : {
"total_operations" : 15,
"total_time_in_millis" : 4886,
"total_size_in_bytes" : 195480032,
"avg_time_in_millis" : 261,
"avg_size_in_bytes" : 10347232
}

_disk_usage reports this as the biggest culprit:
"fields": {
"_recovery_source": {
"total": "37mb",
"total_in_bytes": 38837261,
"inverted_index": {
"total": "0b",
"total_in_bytes": 0
},
"stored_fields": "37mb",
"stored_fields_in_bytes": 38837261,

What's the point in storing it in Elasticsearch if you want to do this?

I wanted to remove _source from the data completely and store only a few integers + timestamp per message. I didn't understand why the reported stored data was still inexplicably way too large. After gradually removing indexed data to find where the extra space was coming from, I ended up in the state I am now. So basically, this is not what I want to use, it's just a data point in my measurements that I want to understand.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.