Elasticsearch-HQ screenshot of node analysis
https://github.com/putztzu/Misc_images/blob/master/elasticsearch-hq_Why_ES4.png
Elasticsearch 1.0 RC1
5-Node Cluster information
ES-Marvel-openSUSE
(Runs web, logstash, redis and other apps so given more RAM)
4GB RAM / 20GB ES Storage
Elasticsearch-1
1GB RAM / 20GB ES Storage
Elasticsearch-2
1GB RAM / 20GB ES Storage
Elasticsearch-3
1GB RAM / 20GB ES Storage
Elasticsearch-4
1GB RAM / 20GB ES Storage
Data Description
Apache data, indexed by date
Data Content
each index should more or less be the same amount of data.
The expectation should be that on average shards should therefor be more or
less the same size.
"Normal" behavior observed
The same data has been inserted into this cluster 3 times (of course purge
between each reload)
The first two times data was distributed across all the nodes more or less
evenly.
Anomaly observed
The current third time the cluster was setup, from the outset an anomaly
was observed, data usage grew unusually fast on the node the data was being
inserted into (ES-Marvel-openSUSE) and one of the other nodes
(ELASTICSEARCH-4).
Moreover, after 2 cluster shutdowns and recoveries, the problem seems to be
exacerbated. the unequal data distribution not only persisted, but it looks
like with each recovery permanent additional data was created across all
nodes.
- 
When a cluster is recovered and no additional raw data is inserted, the 
 increase in data storage suggests that additional ES data is created which
 may make sense since it looks like shard re-allocation takes place
 regardless whether it should have been disabled. This can make sense to
 some degree since it has been posted that it's cheaper to simply copy
 shards than to do integrity checks and re-integrate. Am running Marvel but
 according to es-head and es-hq the Marvel data is very little compared to
 the major increases I'm seeing and those shards aren't being allocated to
 ES-4. Does the increase in used disk storage suggest that obsolete data is
 not being purged?
- 
What determines "balance" regarding shard allocation? Using es-head, I 
 can see that fewer shards might be allocated to the node with
 fast shrinking disk space (ELASTICSEARCH-4), but after awhile it looks like
 allocation goes back to normal. Note that RAM and CPU capacity for all
 nodes is equal.
- 
In this kind of situation, is there a recommended remedy? Since this 
 appears to be a "runaway" scenario that appears to keep feeding a node that
 shortly won't have any capacity, I've been considering simply shutting down
 the problem node, purging its data, re-joining and then hoping the ES
 Cluster will then re-balance itself. Would this be a recommended procedure
 after verifying all shards on the problem node have replicas on other
 nodes? If the situation is "runaway" I don't consider simply adding storage
 to be a viable solution.
- 
The Host machine these virtual machines is running on indicates massive 
 disk activity, but am uncertain what to attribute it to. According to
 es-hq, two indices seem to be in the process of being initialized but
 according to es-head all shards have been allocated and "green." Since no
 new data is being inserted and and all existing shards should be healthy, I
 don't know why there should be any index initialization activity. Update-
 After sitting on es-hq awhile, I'm noticing that after shard
 initialization, there is a re-allocation which might be related. But no
 easy visibility on what this shard is on which node and if it really is
 being re-allocated.
- 
Is there a ready tool to display (or return) specifically the ES 
 overhead data I suspect is being stored on nodes? So far I've only found
 overall data usage or free space. If not available, I suspect a workaround
 could be to query for the shard data size(?) and then subtract from overall
 storage data used. If such a tool exists and perhaps even breaking down how
 it's being used then maybe I can start to understand exactly what may be
 running differently in this cluster.
Am speculating that something may not have been setup properly in this
cluster from the beginning, but am uncertain how to analyze exactly what
the problem is. Have posted the elasticsearch-hq screenshot at the top of
this post for reference, but if someone can suggest a command to further
extract possibly useful information, I'm open.
Thankfully this cluster is a lab, so I'm treating this as a learning
experience but if this occurred in a larger Production cluster I imagine
this would be setting off alarm bells.
Thx
Tony
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/255454b9-e885-4ada-8b0c-4c28018ebc4c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.