I am seeing some erroneous behavior in my ES cluster when performing
aggregations. Originally, I thought this was specific to a histogram as
that is where the error first appeared (in a K3 graph - see my post
https://groups.google.com/forum/#!topic/elasticsearch/iY-lKjtW7PM for
reference) but I have been able to re-create the exception with a simple
max aggregation. The details are as follows:
ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux
To create the issue I execute the following query against the cluster:
{
"query": {
"term": {
"metric": "used"
}
},
"aggs": {
"max_val": {
"max": {
"field": "metric_value"
}
}
}
}
Upon executing this query multiple times, I get different responses. One
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}
whereas on another request with the same query I get the following bad
response:
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}
Some possibly relevant observations:
- In my first set of tests, I was consistently getting the correct
results for the first 2 requests and the bad result on the 3rd request
(with no one else executing this query at that point in time) - Flushing the cache did not correct the issue
- I reduced the number of replicas to 0 and was consistently getting the
same result (which happened to be the correct one) - After increasing the replica count back to 2 and waiting until ES
reported that the replication was complete, I tried the same experiment.
This time, the 1st request retrieved the correct result and the next 2
requests retrieved incorrect results. In this case the incorrect results
were not the same but were both huge and of the same order of magnitude.
Other info:
- The size of the index was about 3.3Gb with ~ 50M documents in it
- This is one of many date based indices (i.e. similar to the logstash
index setup), but the only one in this installation that exhibited the
issue. I believe we saw something similar in a UAT environment as well
where 1 or 2 of the indices acted in this weird manner - ES reported the entire cluster as green
It seems that some shard(s)/replica(s) were being corrupted on the
replication and we were being routed to that one every 3rd hit. (Is this
somehow correlated to the number of replicas?)
So, my questions are:
- Has anyone seen this type of behavior before?
- Can it somehow be data dependent?
- Is there any way to figure out what happened/what is happening?
- Why does ES report the cluster state as green?
- How can I debug this?
- How can I prevent/correct this?
Any and all help/pointers would be greatly appreciated.
Thanks in advance,
MC
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2461a3f0-aee4-45f7-9210-3ef3524b12c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.