Unbalanced disk usage - ES 2.1

(Varun Arora) #1

On a 5 node cluster, the shards are balanced, but the disk space used is unbalanced. Checking the directories, I found that the shards are taking around 10 times the size of the shard on the disk.

//tickets_v2 2 p STARTED 1688677 30.5gb Occulus
tickets_v2 2 r STARTED 1688677 28.9gb Lila Cheney
tickets_v2 4 p STARTED 1690046 34.6gb Lila Cheney
tickets_v2 4 r STARTED 1690046 30.1gb Equinox
tickets_v2 1 r STARTED 1687292 26.9gb Occulus
tickets_v2 1 p STARTED 1687292 30.1gb Cowgirl
tickets_v2 3 p STARTED 1688535 31.3gb Mastermind
tickets_v2 3 r STARTED 1688535 27.9gb Equinox
tickets_v2 0 r STARTED 1688199 28.6gb Mastermind
tickets_v2 0 p STARTED 1688198 30.2gb Cowgirl

root@elasticsearch2:/data/elasticsearch/touch/nodes/0/indices/tickets_v2# du -sh *
299G 3
300G 4
8.0K _state

On the other node it is -

root@elasticsearch1:/data/elasticsearch/touch/nodes/0/indices/tickets_v2# du -sh *
28G 1
302G 2
8.0K _state
Why would this be? and any solutions to this ?

(Mark Walkom) #2

Elasticsearch only balances by shard count, not size.

(Christian Dahlqvist) #3

That looks like a big difference. Are all nodes running exactly the same version?

(Varun Arora) #4

Yes.. all are running the same version - 2.1.1

(Christian Dahlqvist) #5

Based on the shard listing it looks like data is evenly distributed across the nodes as each node have 2 shards that are all similar in size.

(Varun Arora) #6

Yes, would adding nodes and relocating them to the new one help ? or is it jus the movement of shards from the disk to the other node ?

Any other way you could see to alleviate this ?

Could it be something in 2.x? We have not seen this issue with 5.x

(Varun Arora) #7

Would the merge api help here ?

What I see in newrelic plugin is that number of documents is same on all boxes.

(Christian Dahlqvist) #8

What does GET /_nodes/stats/indices give?

(Varun Arora) #9

Any place where I can paste the output ? Its too long

(Christian Dahlqvist) #10

Put it in a gist and link to it here.

(Varun Arora) #11

Here it is :

(Christian Dahlqvist) #12

Based on that the data seems reasonably evenly distributed across the nodes.

(Varun Arora) #13

Any suggestions to alleviate this? New node addition ? relocation of shards? or merge api ?

I am running out of ideas.

(Christian Dahlqvist) #14

I do not see what the problem is. Distribution seems even across the nodes.

(Varun Arora) #15

Any other clue why the disk usage would be bloated for any shard ? Do you think its an issue related to 2.x ?

(Christian Dahlqvist) #16

I do not know as I have not used version 2.x in quite some time. It may help if you can identify which types of files that make up the difference.

(Wédney Yuri) #17

Hi @Christian_Dahlqvist,

I'm experiencing a similar issue in elasticsearch 6.2.3. The shard size is not equal across nodes.

The primary node shard is using 57.8gb of storage while the replicas are using 264.5gb.

(Wédney Yuri) #18

In the graph below you can see the difference between the master and the replicas. This index contains only one shard.



cluster.name: ${CLUSTER_NAME}
cluster.routing.allocation.awareness.attributes: aws_availability_zone
cloud.node.auto_attributes: true
plugin.mandatory: discovery-ec2,repository-s3
transport.tcp.compress: true
indices.queries.cache.size: 30%
indices.requests.cache.size: 20%
indices.memory.index_buffer_size: 20%
indices.memory.max_index_buffer_size: 512mb
action.auto_create_index: false
action.destructive_requires_name: true
node.master: ${ES_NODE_MASTER}
node.data: ${ES_NODE_DATA}
bootstrap.memory_lock: true
http.cors.enabled: true
http.cors.allow-origin: '*'
discovery.zen.minimum_master_nodes: ${SPLIT_BRAIN_NODES}
discovery.ec2.tag.cluster: ${CLUSTER_NAME}
discovery.ec2.endpoint: ec2.${AWS_REGION}.amazonaws.com
discovery.zen.ping_timeout: 30s
discovery.zen.hosts_provider: ec2

(Varun Arora) #19

In my case, even that is same :

(Varun Arora) #20

I found the issue affecting me. Its the translog that is not being flushed.

root@elasticsearch1:/data/elasticsearch/touch/nodes/0/indices/tickets_v2/2# du -sh *
33G index
4.0K _state
282G translog

Related to this bug : https://github.com/elastic/elasticsearch/pull/15830

@Christian_Dahlqvist : Shall I flush it with (POST /tickets_v2/_flush) ? What will be its effect on the application? Would the other indices continue to serve ?