Unbalanced disk usage - ES 2.1

On a 5 node cluster, the shards are balanced, but the disk space used is unbalanced. Checking the directories, I found that the shards are taking around 10 times the size of the shard on the disk.

//tickets_v2 2 p STARTED 1688677 30.5gb 10.14.23.210 Occulus
tickets_v2 2 r STARTED 1688677 28.9gb 10.14.66.191 Lila Cheney
tickets_v2 4 p STARTED 1690046 34.6gb 10.14.66.191 Lila Cheney
tickets_v2 4 r STARTED 1690046 30.1gb 10.14.17.10 Equinox
tickets_v2 1 r STARTED 1687292 26.9gb 10.14.23.210 Occulus
tickets_v2 1 p STARTED 1687292 30.1gb 10.14.23.34 Cowgirl
tickets_v2 3 p STARTED 1688535 31.3gb 10.14.65.216 Mastermind
tickets_v2 3 r STARTED 1688535 27.9gb 10.14.17.10 Equinox
tickets_v2 0 r STARTED 1688199 28.6gb 10.14.65.216 Mastermind
tickets_v2 0 p STARTED 1688198 30.2gb 10.14.23.34 Cowgirl

root@elasticsearch2:/data/elasticsearch/touch/nodes/0/indices/tickets_v2# du -sh *
299G 3
300G 4
8.0K _state

On the other node it is -

root@elasticsearch1:/data/elasticsearch/touch/nodes/0/indices/tickets_v2# du -sh *
28G 1
302G 2
8.0K _state
Why would this be? and any solutions to this ?

Elasticsearch only balances by shard count, not size.

That looks like a big difference. Are all nodes running exactly the same version?

Yes.. all are running the same version - 2.1.1

Based on the shard listing it looks like data is evenly distributed across the nodes as each node have 2 shards that are all similar in size.

Yes, would adding nodes and relocating them to the new one help ? or is it jus the movement of shards from the disk to the other node ?

Any other way you could see to alleviate this ?

Could it be something in 2.x? We have not seen this issue with 5.x

Would the merge api help here ?

What I see in newrelic plugin is that number of documents is same on all boxes.

What does GET /_nodes/stats/indices give?

Any place where I can paste the output ? Its too long

Put it in a gist and link to it here.

Here it is :

Based on that the data seems reasonably evenly distributed across the nodes.

Any suggestions to alleviate this? New node addition ? relocation of shards? or merge api ?

I am running out of ideas.

I do not see what the problem is. Distribution seems even across the nodes.

Any other clue why the disk usage would be bloated for any shard ? Do you think its an issue related to 2.x ?

I do not know as I have not used version 2.x in quite some time. It may help if you can identify which types of files that make up the difference.

Hi @Christian_Dahlqvist,

I'm experiencing a similar issue in elasticsearch 6.2.3. The shard size is not equal across nodes.

The primary node shard is using 57.8gb of storage while the replicas are using 264.5gb.

In the graph below you can see the difference between the master and the replicas. This index contains only one shard.

image

elasticsearch.yml:

cluster.name: ${CLUSTER_NAME}
cluster.routing.allocation.awareness.attributes: aws_availability_zone
cloud.node.auto_attributes: true
plugin.mandatory: discovery-ec2,repository-s3
transport.tcp.compress: true
indices.queries.cache.size: 30%
indices.requests.cache.size: 20%
indices.memory.index_buffer_size: 20%
indices.memory.max_index_buffer_size: 512mb
action.auto_create_index: false
action.destructive_requires_name: true
node.master: ${ES_NODE_MASTER}
node.data: ${ES_NODE_DATA}
bootstrap.memory_lock: true
network.host: 0.0.0.0
http.cors.enabled: true
http.cors.allow-origin: '*'
discovery.zen.minimum_master_nodes: ${SPLIT_BRAIN_NODES}
discovery.ec2.tag.cluster: ${CLUSTER_NAME}
discovery.ec2.endpoint: ec2.${AWS_REGION}.amazonaws.com
discovery.zen.ping_timeout: 30s
discovery.zen.hosts_provider: ec2

In my case, even that is same :

I found the issue affecting me. Its the translog that is not being flushed.

root@elasticsearch1:/data/elasticsearch/touch/nodes/0/indices/tickets_v2/2# du -sh *
33G index
4.0K _state
282G translog

Related to this bug : https://github.com/elastic/elasticsearch/pull/15830

@Christian_Dahlqvist : Shall I flush it with (POST /tickets_v2/_flush) ? What will be its effect on the application? Would the other indices continue to serve ?