Remedying improper data allocation across some nodes?


(Tony Su) #1

Elasticsearch-HQ screenshot of node analysis
https://github.com/putztzu/Misc_images/blob/master/elasticsearch-hq_Why_ES4.png

Elasticsearch 1.0 RC1

5-Node Cluster information
ES-Marvel-openSUSE
(Runs web, logstash, redis and other apps so given more RAM)
4GB RAM / 20GB ES Storage
Elasticsearch-1
1GB RAM / 20GB ES Storage
Elasticsearch-2
1GB RAM / 20GB ES Storage
Elasticsearch-3
1GB RAM / 20GB ES Storage

Elasticsearch-4

1GB RAM / 20GB ES Storage

Data Description
Apache data, indexed by date

Data Content
each index should more or less be the same amount of data.
The expectation should be that on average shards should therefor be more or
less the same size.

"Normal" behavior observed
The same data has been inserted into this cluster 3 times (of course purge
between each reload)
The first two times data was distributed across all the nodes more or less
evenly.

Anomaly observed
The current third time the cluster was setup, from the outset an anomaly
was observed, data usage grew unusually fast on the node the data was being
inserted into (ES-Marvel-openSUSE) and one of the other nodes
(ELASTICSEARCH-4).

Moreover, after 2 cluster shutdowns and recoveries, the problem seems to be
exacerbated. the unequal data distribution not only persisted, but it looks
like with each recovery permanent additional data was created across all
nodes.

  1. When a cluster is recovered and no additional raw data is inserted, the
    increase in data storage suggests that additional ES data is created which
    may make sense since it looks like shard re-allocation takes place
    regardless whether it should have been disabled. This can make sense to
    some degree since it has been posted that it's cheaper to simply copy
    shards than to do integrity checks and re-integrate. Am running Marvel but
    according to es-head and es-hq the Marvel data is very little compared to
    the major increases I'm seeing and those shards aren't being allocated to
    ES-4. Does the increase in used disk storage suggest that obsolete data is
    not being purged?

  2. What determines "balance" regarding shard allocation? Using es-head, I
    can see that fewer shards might be allocated to the node with
    fast shrinking disk space (ELASTICSEARCH-4), but after awhile it looks like
    allocation goes back to normal. Note that RAM and CPU capacity for all
    nodes is equal.

  3. In this kind of situation, is there a recommended remedy? Since this
    appears to be a "runaway" scenario that appears to keep feeding a node that
    shortly won't have any capacity, I've been considering simply shutting down
    the problem node, purging its data, re-joining and then hoping the ES
    Cluster will then re-balance itself. Would this be a recommended procedure
    after verifying all shards on the problem node have replicas on other
    nodes? If the situation is "runaway" I don't consider simply adding storage
    to be a viable solution.

  4. The Host machine these virtual machines is running on indicates massive
    disk activity, but am uncertain what to attribute it to. According to
    es-hq, two indices seem to be in the process of being initialized but
    according to es-head all shards have been allocated and "green." Since no
    new data is being inserted and and all existing shards should be healthy, I
    don't know why there should be any index initialization activity. Update-
    After sitting on es-hq awhile, I'm noticing that after shard
    initialization, there is a re-allocation which might be related. But no
    easy visibility on what this shard is on which node and if it really is
    being re-allocated.

  5. Is there a ready tool to display (or return) specifically the ES
    overhead data I suspect is being stored on nodes? So far I've only found
    overall data usage or free space. If not available, I suspect a workaround
    could be to query for the shard data size(?) and then subtract from overall
    storage data used. If such a tool exists and perhaps even breaking down how
    it's being used then maybe I can start to understand exactly what may be
    running differently in this cluster.

Am speculating that something may not have been setup properly in this
cluster from the beginning, but am uncertain how to analyze exactly what
the problem is. Have posted the elasticsearch-hq screenshot at the top of
this post for reference, but if someone can suggest a command to further
extract possibly useful information, I'm open.

Thankfully this cluster is a lab, so I'm treating this as a learning
experience but if this occurred in a larger Production cluster I imagine
this would be setting off alarm bells.

Thx
Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/255454b9-e885-4ada-8b0c-4c28018ebc4c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #2