Storage gains in removing redundant message fields

Consider the following document (no logstash is involved here):

{
    "_index": "day-wise-2018.11.25",
    "_type": "type1",
    "_id": "11111111111111111111111111111111111111111111111111111111",
    "_score": 1,
    "_source": {
      "id": "012345678910111213",
      "LOAD_AVG_MIN": 1.000001,
      "ts": "2018-11-25T01:41:04.045Z",
      "u": "Load",
      "foo_field": "abcdef-ghi-jk-lmnopq-rstu-v-release_181010_270",
      "@id": "11111111111111111111111111111111111111111111111111111111",
      "@timestamp": "2018-11-25T01:41:04.221Z",
      "@message": "{\"id\":\"012345678910111213\",\"LOAD_AVG_MIN\":1.000001,\"ts\":\"2018-11-	25T01:41:04.045Z\",\"u\":\"Load\",\"foo_field\":\"abcdef-ghi-jk-lmnopq-rstu-v-release_181010_270\"}",
      "@owner": "foo_owner",
      "@log_group": "type1",
    }
  }

As you can see, the @message field looks to be redundant here since the contents of it are indexed as individual K,V pairs. for e.g the id, LOAD_AVG_MIN, ts etc are all indexed are separate fields (and their mappings too are defined).

  1. Will it result in considerable reduction in size if I do away with the @message field? In the example above the contents of the field are pretty small but it can be large as well
  2. I'm also thinking to do away with the redundant @id and @log_group fields which have the same values as _id and _type respectively. Does it yield any benefit?

@warkolm - any suggestions here?

@Christian_Dahlqvist any thoughts here?

Yes, that can reduce how much space your data takes up on disk. It is hard to tell how much difference it will make so I would recommend that you test it.

Thanks @Christian_Dahlqvist. One more question on storage gains is: does it save considerable storage if we set enabled: false i.e store the field(s) but not index it (and thereby not make them searchable).

Yes, that is another way to save space in case you need to keep the raw data.

Thank you. Will do a quick POC with enabled: false

As disk usage can vary depending on how much data you have per shard and where in the merging cycle you are, I would recommend indexing a reasonably large amount of data into an index with a single shard and then force merge this down to a single segment at the end. That will ensure you are doing as fair a comparison as possible.

You mean to say that we first forcemerge and see the disk occupied and then index with enabled:false, run forcemergeand then again check the disk occupied to get a fair comparison?

Index the data into an index with a single primary shard with the current mappings. Then index the same data into a separate index with a single shard using the updated mappings. Then forcemerge both indices down to a single segment and compare the size.

Thank you for the excellent clarification.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.