Elasticsearch Data Directory size anomaly

Hi All ,

    It is a theoretical question.

I am indexed almost 2 GB of data in elasticsearch with default Shards & Replicas .

Logs indexed from Feb 1 to July 15 year .

Index name is with respect to date . ie in logstash " index => "xyz-%{+YYYY.MM.dd}" "

This logs total size is only 2 GB . And I am created 10+ ML jobs

After the indexing process I am checked Data directory size . It is 16 GB NOW.

How it is happened ?

Indexed logs are only 2 GB size. But Data directory size is 16 GB now.

Why Data directory size is 16 GB?

One more question . If I am pushing 10Kb logs / Seconds to Logstash . Logstash pushing this logs to ES.

Then how much space required in ES for an year of storage .

What is the other specifications are needed for ELK server. Ram , processor etc.

Note: Logs are indexing with respect to date (xyz-%{+YYYY.MM.dd})

My Elasticsearch.yml is shown below.

ES Conf

======================== Elasticsearch Configuration =========================

NOTE: Elasticsearch comes with reasonable defaults for most settings.

Before you set out to tweak and tune the configuration, make sure you

understand what are you trying to accomplish and the consequences.

The primary way of configuring a node is via this file. This template lists

the most important settings you may want to configure for a production cluster.

Please consult the documentation for further information on configuration options:

https://www.elastic.co/guide/en/elasticsearch/reference/index.html

---------------------------------- Cluster -----------------------------------

Use a descriptive name for your cluster:

cluster.name: ES

------------------------------------ Node ------------------------------------

Use a descriptive name for the node:

node.name: Server

Add custom attributes to the node:

#node.attr.rack: r1

----------------------------------- Paths ------------------------------------

Path to directory where to store the data (separate multiple locations by comma):

#path.data: /path/to/data

Path to log files:

path.logs: /home/elastic/elk/elasticsearch/elasticsearch-6.3.1/logs

----------------------------------- Memory -----------------------------------

Lock the memory on startup:

#bootstrap.memory_lock: true

Make sure that the heap size is set to about half the memory available

on the system and that the owner of the process is allowed to use this

limit.

Elasticsearch performs poorly when the system is swapping the memory.

---------------------------------- Network -----------------------------------

Set the bind address to a specific IP (IPv4 or IPv6):

network.host: 192.168.1.15

Set a custom port for HTTP:

http.port: 9200

For more information, consult the network module documentation.

--------------------------------- Discovery ----------------------------------

Pass an initial list of hosts to perform discovery when new node is started:

The default list of hosts is ["127.0.0.1", "[::1]"]

#discovery.zen.ping.unicast.hosts: ["host1", "host2"]

Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):

#discovery.zen.minimum_master_nodes:

For more information, consult the zen discovery module documentation.

---------------------------------- Gateway -----------------------------------

Block initial recovery after a full cluster restart until N nodes are started:

#gateway.recover_after_nodes: 3

For more information, consult the gateway module documentation.

---------------------------------- Various -----------------------------------

Require explicit names when deleting indices:

#action.destructive_requires_name: true
xpack.security.enabled: true

It depends on what is indexed. What is the mapping, etc...
Lot of things are involved.

1 Like

Thank you @dadoonet

But the thing is indexed logs total size is 2 GB only. Data folder size is 16 GB

And I am created 10+ ML jobs

What you mean by mapping ?

Thank you @dadoonet ..

Now I understand . It is depends on the parsing . Fields , data types , geolocations etc

It also sounds like you have a lot of very small shards, which is inefficient. If you switched to e.g. monthly indices with a single primary shard I would expect the size of data on disk to shrink.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.