Incorrect Aggregations returned from ES

MC1 · March 25, 2015, 10:57pm

I am seeing some erroneous behavior in my ES cluster when performing
aggregations. Originally, I thought this was specific to a histogram as
that is where the error first appeared (in a K3 graph - see my post
https://groups.google.com/forum/#!topic/elasticsearch/iY-lKjtW7PM for
reference) but I have been able to re-create the exception with a simple
max aggregation. The details are as follows:

ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux

To create the issue I execute the following query against the cluster:

{
"query": {
"term": {
"metric": "used"
}
},
"aggs": {
"max_val": {
"max": {
"field": "metric_value"
}
}
}
}

Upon executing this query multiple times, I get different responses. One
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}

whereas on another request with the same query I get the following bad
response:

"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}

Some possibly relevant observations:

In my first set of tests, I was consistently getting the correct
results for the first 2 requests and the bad result on the 3rd request
(with no one else executing this query at that point in time)
Flushing the cache did not correct the issue
I reduced the number of replicas to 0 and was consistently getting the
same result (which happened to be the correct one)
After increasing the replica count back to 2 and waiting until ES
reported that the replication was complete, I tried the same experiment.
This time, the 1st request retrieved the correct result and the next 2
requests retrieved incorrect results. In this case the incorrect results
were not the same but were both huge and of the same order of magnitude.

Other info:

The size of the index was about 3.3Gb with ~ 50M documents in it
This is one of many date based indices (i.e. similar to the logstash
index setup), but the only one in this installation that exhibited the
issue. I believe we saw something similar in a UAT environment as well
where 1 or 2 of the indices acted in this weird manner
ES reported the entire cluster as green

It seems that some shard(s)/replica(s) were being corrupted on the
replication and we were being routed to that one every 3rd hit. (Is this
somehow correlated to the number of replicas?)

So, my questions are:

Has anyone seen this type of behavior before?
Can it somehow be data dependent?
Is there any way to figure out what happened/what is happening?
Why does ES report the cluster state as green?
How can I debug this?
How can I prevent/correct this?

Any and all help/pointers would be greatly appreciated.

Thanks in advance,
MC

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2461a3f0-aee4-45f7-9210-3ef3524b12c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

thanodnl · April 14, 2015, 1:55pm

Hi,

To me this sounds a lot like an issue that was happening to me a week
before the release of 1.0.0. This issue was related to internal memory
reuse within Elasticsearch before the result was read out. The issue is
documented here: Aggregations return different counts when invoked twice in a row · Issue #5021 · elastic/elasticsearch · GitHub

What I did back then was create a reproducible test that showed the issue.

I doubt it has to do with your replica's being inconsistent. Especially
since you turned off replicating and then turned it back on, this copies
the files you have in your primary to the secondaries.

-- Nils
Here is the test I created in the past:

gist.github.com

https://gist.github.com/thanodnl/8803745

.gitignore

*.out
node_modules/**

README.md

To load the data in to ES you need to start it with more heap, 4G should be enough: `$ ES_HEAP_SIZE=4g ./bin/elasticsearch -f`

To load the dataset for testing run: `$ ./aggsbug.load.sh`
To reproduce the bug run: `$ ./aggsbug.test.sh`

The test script calls the same aggregation twice, stores the output, and later prints the diff between the outputs.

If you see no diff, or only a diff on the `took` field it is normal behavior.
If you see a list of diff's mainly on the `doc_count` fields it is giving back different numbers in both calls .

This file has been truncated. show original

aggsbug.bulk.json

{"index":{"_index":"aggsbug","_type":"messages"}}
{"a":"isgota"}
{"index":{"_index":"aggsbug","_type":"messages"}}
{"a":"MerryFollowers"}
{"index":{"_index":"aggsbug","_type":"messages"}}
{"a":"kareenwoonkaa"}
{"index":{"_index":"aggsbug","_type":"messages"}}
{"a":"RobynRozeAuthor"}
{"index":{"_index":"aggsbug","_type":"messages"}}
{"a":"DPerryLawrence"}

This file has been truncated. show original

There are more than three files. show original

On Wednesday, March 25, 2015 at 11:57:15 PM UTC+1, MC wrote:

I am seeing some erroneous behavior in my ES cluster when performing
aggregations. Originally, I thought this was specific to a histogram as
that is where the error first appeared (in a K3 graph - see my post
Redirecting to Google Groups for
reference) but I have been able to re-create the exception with a simple
max aggregation. The details are as follows:

ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux

To create the issue I execute the following query against the cluster:

{
"query": {
"term": {
"metric": "used"
}
},
"aggs": {
"max_val": {
"max": {
"field": "metric_value"
}
}
}
}

Upon executing this query multiple times, I get different responses. One
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}

whereas on another request with the same query I get the following bad
response:

"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}

Some possibly relevant observations:

In my first set of tests, I was consistently getting the correct
results for the first 2 requests and the bad result on the 3rd request
(with no one else executing this query at that point in time)

Flushing the cache did not correct the issue

I reduced the number of replicas to 0 and was consistently getting the
same result (which happened to be the correct one)

After increasing the replica count back to 2 and waiting until ES
reported that the replication was complete, I tried the same experiment.
This time, the 1st request retrieved the correct result and the next 2
requests retrieved incorrect results. In this case the incorrect results
were not the same but were both huge and of the same order of magnitude.

Other info:

The size of the index was about 3.3Gb with ~ 50M documents in it

This is one of many date based indices (i.e. similar to the logstash
index setup), but the only one in this installation that exhibited the
issue. I believe we saw something similar in a UAT environment as well
where 1 or 2 of the indices acted in this weird manner

ES reported the entire cluster as green

It seems that some shard(s)/replica(s) were being corrupted on the
replication and we were being routed to that one every 3rd hit. (Is this
somehow correlated to the number of replicas?)

So, my questions are:

Has anyone seen this type of behavior before?

Can it somehow be data dependent?

Is there any way to figure out what happened/what is happening?

Why does ES report the cluster state as green?

How can I debug this?

How can I prevent/correct this?

Any and all help/pointers would be greatly appreciated.

Thanks in advance,
MC

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14c3bff7-b17b-4fa8-938f-cf8e13c80a29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · April 14, 2015, 2:05pm

Nils: It looks different from your issue since document counts are correct
here?

MC: I think it is due to
Mapping updates should be synchronous · Issue #8688 · elastic/elasticsearch · GitHub. Has your max_val
field been dynamically mapped? The only way to prevent this issue is to map
fields explicitely instead of relying on dynamic mappings.

On Tue, Apr 14, 2015 at 3:55 PM, Nils Dijk me@thanod.nl wrote:

Hi,

To me this sounds a lot like an issue that was happening to me a week
before the release of 1.0.0. This issue was related to internal memory
reuse within Elasticsearch before the result was read out. The issue is
documented here: Aggregations return different counts when invoked twice in a row · Issue #5021 · elastic/elasticsearch · GitHub

What I did back then was create a reproducible test that showed the issue.

I doubt it has to do with your replica's being inconsistent. Especially
since you turned off replicating and then turned it back on, this copies
the files you have in your primary to the secondaries.

-- Nils
Here is the test I created in the past:
https://gist.github.com/thanodnl/8803745

On Wednesday, March 25, 2015 at 11:57:15 PM UTC+1, MC wrote:

I am seeing some erroneous behavior in my ES cluster when performing
aggregations. Originally, I thought this was specific to a histogram as
that is where the error first appeared (in a K3 graph - see my post
Redirecting to Google Groups for
reference) but I have been able to re-create the exception with a simple
max aggregation. The details are as follows:

ES Version: 1.4.4
Topology: 5 nodes, 5 shards per index, 2 replicas
OS: Redhat Linux

To create the issue I execute the following query against the cluster:

{
"query": {
"term": {
"metric": "used"
}
},
"aggs": {
"max_val": {
"max": {
"field": "metric_value"
}
}
}
}

Upon executing this query multiple times, I get different responses. One
time I get the expected result:
...
"took": 13,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 18096380}}

whereas on another request with the same query I get the following bad
response:

"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 11712,
"max_score": 9.361205,
...
"aggregations": { "max_val": { "value": 4697741490703565000}}

Some possibly relevant observations:

In my first set of tests, I was consistently getting the correct
results for the first 2 requests and the bad result on the 3rd request
(with no one else executing this query at that point in time)

Flushing the cache did not correct the issue

I reduced the number of replicas to 0 and was consistently getting
the same result (which happened to be the correct one)

After increasing the replica count back to 2 and waiting until ES
reported that the replication was complete, I tried the same experiment.
This time, the 1st request retrieved the correct result and the next 2
requests retrieved incorrect results. In this case the incorrect results
were not the same but were both huge and of the same order of magnitude.

Other info:

The size of the index was about 3.3Gb with ~ 50M documents in it

This is one of many date based indices (i.e. similar to the logstash
index setup), but the only one in this installation that exhibited the
issue. I believe we saw something similar in a UAT environment as well
where 1 or 2 of the indices acted in this weird manner

ES reported the entire cluster as green

It seems that some shard(s)/replica(s) were being corrupted on the
replication and we were being routed to that one every 3rd hit. (Is this
somehow correlated to the number of replicas?)

So, my questions are:

Has anyone seen this type of behavior before?

Can it somehow be data dependent?

Is there any way to figure out what happened/what is happening?

Why does ES report the cluster state as green?

How can I debug this?

How can I prevent/correct this?

Any and all help/pointers would be greatly appreciated.

Thanks in advance,
MC

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/14c3bff7-b17b-4fa8-938f-cf8e13c80a29%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/14c3bff7-b17b-4fa8-938f-cf8e13c80a29%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
Adrien

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAO5%3DkAhroYFe4uk%2BmeE0bbz86S73vRAZZ%2Bk462W%3DtN4t%2BQv5Bg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Using multiple shards causes incorrect results to be generated Elasticsearch	4	1279	November 1, 2017
Weird aggregation issue Elasticsearch	3	1782	July 5, 2017
Aggregation returning contradictory results Elasticsearch	3	351	November 3, 2020
Aggregation fails silently when ran on huge data, instead of error returns the default response of 10 docs and hits Elasticsearch	6	537	July 1, 2021
0 hits, 14693688 total on index with 700 documents Elasticsearch	15	2257	November 7, 2017

Incorrect Aggregations returned from ES

Related topics