Storage gains in removing redundant message fields

sandeepkanabar · December 6, 2018, 10:04pm

Consider the following document (no logstash is involved here):

{
    "_index": "day-wise-2018.11.25",
    "_type": "type1",
    "_id": "11111111111111111111111111111111111111111111111111111111",
    "_score": 1,
    "_source": {
      "id": "012345678910111213",
      "LOAD_AVG_MIN": 1.000001,
      "ts": "2018-11-25T01:41:04.045Z",
      "u": "Load",
      "foo_field": "abcdef-ghi-jk-lmnopq-rstu-v-release_181010_270",
      "@id": "11111111111111111111111111111111111111111111111111111111",
      "@timestamp": "2018-11-25T01:41:04.221Z",
      "@message": "{\"id\":\"012345678910111213\",\"LOAD_AVG_MIN\":1.000001,\"ts\":\"2018-11-	25T01:41:04.045Z\",\"u\":\"Load\",\"foo_field\":\"abcdef-ghi-jk-lmnopq-rstu-v-release_181010_270\"}",
      "@owner": "foo_owner",
      "@log_group": "type1",
    }
  }

As you can see, the @message field looks to be redundant here since the contents of it are indexed as individual K,V pairs. for e.g the id, LOAD_AVG_MIN, ts etc are all indexed are separate fields (and their mappings too are defined).

Will it result in considerable reduction in size if I do away with the @message field? In the example above the contents of the field are pretty small but it can be large as well
I'm also thinking to do away with the redundant @id and @log_group fields which have the same values as _id and _type respectively. Does it yield any benefit?

sandeepkanabar · December 10, 2018, 7:17pm

@warkolm - any suggestions here?

sandeepkanabar · December 19, 2018, 3:28am

@Christian_Dahlqvist any thoughts here?

Christian_Dahlqvist · December 19, 2018, 7:28am

Yes, that can reduce how much space your data takes up on disk. It is hard to tell how much difference it will make so I would recommend that you test it.

sandeepkanabar · December 19, 2018, 7:40am

Thanks @Christian_Dahlqvist. One more question on storage gains is: does it save considerable storage if we set enabled: false i.e store the field(s) but not index it (and thereby not make them searchable).

Christian_Dahlqvist · December 19, 2018, 7:42am

Yes, that is another way to save space in case you need to keep the raw data.

sandeepkanabar · December 19, 2018, 7:42am

Thank you. Will do a quick POC with enabled: false

Christian_Dahlqvist · December 19, 2018, 7:47am

As disk usage can vary depending on how much data you have per shard and where in the merging cycle you are, I would recommend indexing a reasonably large amount of data into an index with a single shard and then force merge this down to a single segment at the end. That will ensure you are doing as fair a comparison as possible.

sandeepkanabar · December 19, 2018, 7:50am

You mean to say that we first forcemerge and see the disk occupied and then index with enabled:false, run forcemergeand then again check the disk occupied to get a fair comparison?

Christian_Dahlqvist · December 19, 2018, 7:59am

Index the data into an index with a single primary shard with the current mappings. Then index the same data into a separate index with a single shard using the updated mappings. Then forcemerge both indices down to a single segment and compare the size.

sandeepkanabar · December 19, 2018, 8:01am

Thank you for the excellent clarification.

system · January 16, 2019, 8:01am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ELK cluster disk space usage optimization Elasticsearch	9	2467	July 5, 2017
Storing logs: 'message' field doubles index size? Logstash	2	575	February 27, 2017
Efficient storage Elasticsearch	4	412	November 21, 2019
How to remove "message" field added by Logstash Logstash	7	1811	March 24, 2020
Remove_field in csv filter section Logstash	7	3337	January 18, 2018

Storage gains in removing redundant message fields

Related topics