Index Space Utilization On Elastic Nodes

Hi Team,

In one of our elastic cluster having two nodes, created index with shards 6 and replica 0.
In that index we had loaded a file of size is 1.1 GB.
While checking the index size by using OS command in two nodes in each node it is occupying approx. 1.5GB.

> [root@elasticnode3 indices]# du -sh 6hFx5kOeSnme3RE7QDntLw
> 1.5G    6hFx5kOeSnme3RE7QDntLw
> [root@elasticnode1 indices]# du -sh 6hFx5kOeSnme3RE7QDntLw
> 1.5G    6hFx5kOeSnme3RE7QDntLw

Could you please help me to understand why the index usage on OS level is 3 times higher what the file space utilization.

Thanks,
Debasis

It depends... Of your mapping.

Basically a json like { "foo": "bar baz" } might be "stored/indexed" by default as:

  • bar baz (foo.keyword keyword field)
  • [bar, baz] (foo text field)
  • { "foo": "bar baz" } (_source meta field)

Not including some other metadata like the _id...

hi @dadoonet for your quick response.
Here I am loading the csv files to the Index and index mapping as below.

PUT /elastictemp3
{
"settings": 
{
 "refresh_interval": "30s",
  "number_of_shards": "6",
  "number_of_replicas": "0"
   },
  "mappings":{
    "properties":{
      "sequence":{
      "type":"keyword"
      },
      "component":{
        "type":"keyword"
      },
	  "tenant":{
	  "type":"keyword"
	  },
	  "service_id":{
	  "type":"keyword"
	  },
	  "session_id":{
	  "type":"keyword"
	  },
	  "timestamp":{
	  "type": "date",
      "format": "epoch_millis"
	  },
      "edr_version":{
        "type":"keyword"
      },
      "prov_version":{
        "type":"keyword"
      },
      "action_id":{
        "type":"keyword"
      },
      "action_extra_info":{
        "type":"keyword"
      },
      "rule_type":{
        "type":"keyword"
      },
	  "rule_id":{
        "type":"keyword"
      },
	  "traffic_type":{
        "type":"keyword"
      },
	  "ac_group":{
        "type":"keyword"
      },
	  "ac_version":{
        "type":"keyword"
      },
	  "opcode":{
        "type":"keyword"
      },
	  "cg_plan":{
        "type":"keyword"
      },
	  "cg_gt":{
        "type":"keyword"
      },
	  "cg_ssn":{
        "type":"keyword"
      },
	  "cg_operator":{
        "type":"keyword"
      },
	  "cd_plan":{
        "type":"keyword"
      },
	  "cd_gt":{
        "type":"keyword"
      },
	  "cd_ssn":{
        "type":"keyword"
      },
	  "cd_operator":{
        "type":"keyword"
      },
	  "imsi":{
        "type":"keyword"
      },
	  "location_operator":{
        "type":"keyword"
      },
	  "location_timestamp":{
        "type": "date",
        "format": "epoch_millis"
      },
	  "sms_content":{
        "type":"keyword"
      },
	  "message_id":{
        "type":"keyword"
      },
	  "fragment_number":{
        "type":"keyword"
      },
	  "sms_analytics_case_id":{
        "type":"keyword"
      },
	  "msisdn":{
        "type":"keyword"
      },
	  "msisdn_ton":{
        "type":"keyword"
      },
	  "msisdn_npi":{
        "type":"keyword"
      },
	  "tpoa":{
        "type":"keyword"
      },
	  "sender_type":{
        "type":"keyword"
      },
	  "sms_hub_type":{
        "type":"keyword"
      },
	  "embedded_url":{
        "type":"keyword"
      },
	  "domain":{
        "type":"keyword"
      },
	  "msg_source":{
        "type":"keyword"
      },
	  "smsc":{
        "type":"keyword"
      },
	  "pid":{
        "type":"keyword"
      },
	  "dcs":{
        "type":"keyword"
      },
	  "message_type":{
        "type":"keyword"
      }
	}
  }
}

Actually we need to upload 15 billion data to the index through filebeat process so want to estimate the space utilization part.
Could you please help, where we can found which is portion of index occupying the space and how we can wisely use the space on OS level.

Thanks,
Debasis

Do you need to search, sort, aggregate all the fields? If no, you can set index: false on some fields.
Do you need to get back the whole json document? If no, disable the source.

You can also think of merging the segments (with forcemerge).

How many documents have you indexed so far within the 3gb disk space?

What is the actual mapping of the index? For example, if you haver fields that are not listed on this mapping Elasticsearch may map it as both keyword and text.

What is the response for GET elastictemp3/_mapping for example.

Please find the O/P as below.

{
  "elastictemp3": {
    "mappings": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ac_group": {
          "type": "keyword"
        },
        "ac_version": {
          "type": "keyword"
        },
        "action_extra_info": {
          "type": "keyword"
        },
        "action_id": {
          "type": "keyword"
        },
        "agent": {
          "properties": {
            "ephemeral_id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "type": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "version": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "cd_gt": {
          "type": "keyword"
        },
        "cd_operator": {
          "type": "keyword"
        },
        "cd_plan": {
          "type": "keyword"
        },
        "cd_ssn": {
          "type": "keyword"
        },
        "cg_gt": {
          "type": "keyword"
        },
        "cg_operator": {
          "type": "keyword"
        },
        "cg_plan": {
          "type": "keyword"
        },
        "cg_ssn": {
          "type": "keyword"
        },
        "component": {
          "type": "keyword"
        },
        "dcs": {
          "type": "keyword"
        },
        "domain": {
          "type": "keyword"
        },
        "ecs": {
          "properties": {
            "version": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "edr_version": {
          "type": "keyword"
        },
        "embedded_url": {
          "type": "keyword"
        },
        "fragment_number": {
          "type": "keyword"
        },
        "host": {
          "properties": {
            "architecture": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "containerized": {
              "type": "boolean"
            },
            "hostname": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "id": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "ip": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "mac": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "name": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            },
            "os": {
              "properties": {
                "codename": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "family": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "kernel": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "name": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "platform": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "type": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                },
                "version": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            }
          }
        },
        "imsi": {
          "type": "keyword"
        },
        "input": {
          "properties": {
            "type": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        },
        "location_operator": {
          "type": "keyword"
        },
        "location_timestamp": {
          "type": "date",
          "format": "epoch_millis"
        },
        "log": {
          "properties": {
            "file": {
              "properties": {
                "path": {
                  "type": "text",
                  "fields": {
                    "keyword": {
                      "type": "keyword",
                      "ignore_above": 256
                    }
                  }
                }
              }
            },
            "offset": {
              "type": "long"
            }
          }
        },
        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "message_id": {
          "type": "keyword"
        },
        "message_type": {
          "type": "keyword"
        },
        "msg_source": {
          "type": "keyword"
        },
        "msisdn": {
          "type": "keyword"
        },
        "msisdn_npi": {
          "type": "keyword"
        },
        "msisdn_ton": {
          "type": "keyword"
        },
        "opcode": {
          "type": "keyword"
        },
        "pid": {
          "type": "keyword"
        },
        "prov_version": {
          "type": "keyword"
        },
        "rule_id": {
          "type": "keyword"
        },
        "rule_type": {
          "type": "keyword"
        },
        "sender_type": {
          "type": "keyword"
        },
        "sequence": {
          "type": "keyword"
        },
        "service_id": {
          "type": "keyword"
        },
        "session_id": {
          "type": "keyword"
        },
        "sms_analytics_case_id": {
          "type": "keyword"
        },
        "sms_content": {
          "type": "keyword"
        },
        "sms_hub_type": {
          "type": "keyword"
        },
        "smsc": {
          "type": "keyword"
        },
        "tenant": {
          "type": "keyword"
        },
        "timestamp": {
          "type": "date",
          "format": "epoch_millis"
        },
        "tpoa": {
          "type": "keyword"
        },
        "traffic_type": {
          "type": "keyword"
        }
      }
    }
  }
}

Thanks,
Debasis

You should remove all the filebeat metadata you don't need.

Hi @dadoonet
Are you referring to making _all to set false mentioned in in the example below.

"_all": {
"enabled": false

If yes then, I tried to do but it was not supported in 8.9 version of Elasticsearch.

PUT my_index
{
  "mappings": {
    "type_1": { 
      "properties": {...}
    },
    "type_2": { 
      "_all": {
        "enabled": false
      },
      "properties": {...}
    }
  }
}

So could you please tell how to achieve the same.

Thanks,
Debasis

Have a look at _source field | Elasticsearch Guide [8.12] | Elastic

If disk usage is important to you then have a look at synthetic _source which shrinks disk usage at the cost of only supporting a subset of mappings and slower fetches or (not recommended) disabling the _source field which also shrinks disk usage but disables many features.

But more importantly, remove all the filebeat generated fields like agent, ecs, host, input, log, message.

You can do that from the edge with Drop fields from events | Filebeat Reference [8.12] | Elastic

Or in Elasticsearch at index time with an ingest pipeline with: Remove processor | Elasticsearch Guide [8.12] | Elastic

As you can see in your mapping there are a lot of other fields that you didn't mapped and then elasticsearch created the mapping when it first received the fields, but this can waste space as elasticsearch will store string fields as both text and keyword, using way more space.

One example is the message field, which is normally the field with your log before parsing, if you store the original message field it can use a lot of space, in some cases it can double the space used.

Here the message field is being stored as both text and keyword and since this contains your original log, this uses a lot of space.

        "message": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }

And as mentioned by @dadoonet you need to remove all the fields added by filebeat.

thanks a ton @dadoonet & @leandrojmp for your inputs. Its helped us reduce space by 45%.

(a) we set index:false on fields that will not be used for search.
(b) we set doc_values:false for all fields except text since there is no requirement of sorting, aggregation or scripting. For text it is anyways disabled by default.
(c) we removed the metadata fields with remove processor in the ingest pipeline. Since that is on the elasticsearch side, we are now planning to do it on the filebeat side. If those are not to be stored, why even send them on the wire!
(d) we removed the message field with remove processor in pipeline. Guess, this cannot be done on the filebeat side. We tried and found that there was only metadata in the index.
(e) There are some fields that are not required to be stored, so we have removed those as well. Again, will move those to filebeat.

Few queries, if you can enlighten:

  1. Is it correct that message cannot be removed in filebeat or are we missing something?
  2. While the space is reduced, will adding these remove processors slow down the ingest?
  3. Removing some fields in filebeat and some others in ingest pipeline will be slower than removing all on any one side?
  4. Our understanding was that message field was clearly a duplicate, because all fields are available in the _source. Is the understanding correct or are we missing something?
  5. It is understood that those fields for which index:false and doc_values:false need to be present in _source to return the values when queried. But data (values) for fields that are index:true are available in the index, can we skip storing those in _source? When queried, can some values be returned from their respective index and others from _source?

Thanks again.

Is it correct that message cannot be removed in filebeat or are we missing something?

Correct. Otherwise you don't have any content in Elasticsearch.

While the space is reduced, will adding these remove processors slow down the ingest?

Probably but this should not be super noticeable specifically if you have already some other processors which are extracting data from the message field.

Removing some fields in filebeat and some others in ingest pipeline will be slower than removing all on any one side?

I prefer removing on the edge (filebeat) as, at you said, it reduces the number of data going over the wire. BTW I recently published a post about this topic: Enrich your Elasticsearch documents from the edge | Elastic Blog

Our understanding was that message field was clearly a duplicate, because all fields are available in the _source. Is the understanding correct or are we missing something?

Correct. They don't have the same form as a message or as individual fields but that's indeed a duplication on the content.

It is understood that those fields for which index:false and doc_values:false need to be present in _source to return the values when queried.

Yes.

You can also have a look at synthetic source. This might also reduce the index size. See _source field | Elasticsearch Guide [8.12] | Elastic But it's in preview.

thanks @dadoonet for that superfast reply, appreciate.

can you please suggest on the second part of point 5: those fields that are indexed, can we choose to not store those in _source, which will further help to reduce the size? May be that is same as synthetic _source, but I need to re-read it.

Yes. You can exclude from source. If you are using only them for searching but you don't need to display the result to the end user.

But note that this will prevent you from using some features like the _reindex API which reads the _source field.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.