Inconsistent responses from aggregations (ES1.0.0RC1)

Hi,

I am tinkering with elasticsearch 1.0.0RC1 for a bit. Especially the part
of aggregations. When looking closer to the responses of the aggregations I
noticed the numbers fluctuated all the time.

I have an index:
shards: 10
replicas: 0
documents: ~1M

Currently I'm not ingesting data anymore.

When I try to recreate the terms facet in aggregations I came up with the
following:

{
"size": 0,
"facets": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
},
"aggs": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
}
}

This should give me roundabout the top 10 (*https://github.com/elasticsearch/elasticsearch/issues/1305)
occurring terms in the 'actor.displayName' field. The terms facet gives the
same counts over and over again, which is what is expected. However, the
counts from the aggregations return different numbers every time I invoke
it. Results of 3 consecutive runs: https://gist.github.com/thanodnl/8733837.

Currently I'm reindexing all the documents in an index with only one shard
to see if that makes a difference.
This would only solve the problem short term, but our production load is
too big to fit in one shard.

-- Nils

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/49fe3127-84a1-43d6-a298-6e70ee9d038e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I finished indexing the same dataset in an index with only one shard.

$ curl 'http://localhost:9200/52b1e8c1f8b9d73130000004/_search?pretty=true'
-d '{
"size": 0,
"facets": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
},
"aggs": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
}
}'
{
"took" : 1377,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"facets" : {
"participants" : {
"_type" : "terms",
"missing" : 0,
"total" : 1129848,
"other" : 1111270,
"terms" : [ {
"term" : "totaltrafficbos",
"count" : 3599
}, {
"term" : "mai93thm",
"count" : 2517
}, {
"term" : "mai95thm",
"count" : 2207
}, {
"term" : "mai90thm",
"count" : 2207
}, {
"term" : "totaltrafficnyc",
"count" : 1660
}, {
"term" : "confessions",
"count" : 1534
}, {
"term" : "incidentreports",
"count" : 1468
}, {
"term" : "nji80thm",
"count" : 1180
}, {
"term" : "pai76thm",
"count" : 1142
}, {
"term" : "txi35thm",
"count" : 1064
} ]
}
},
"aggregations" : {
"participants" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 1064
} ]
}
}
}

Now the counts and are the same as with faceting, and more important,
consistent.

Seems like the problem resides in aggs on multiple shards. How to proceed
from here?

-- Nils

On Friday, January 31, 2014 4:30:55 PM UTC+1, Nils Dijk wrote:

Hi,

I am tinkering with elasticsearch 1.0.0RC1 for a bit. Especially the part
of aggregations. When looking closer to the responses of the aggregations I
noticed the numbers fluctuated all the time.

I have an index:
shards: 10
replicas: 0
documents: ~1M

Currently I'm not ingesting data anymore.

When I try to recreate the terms facet in aggregations I came up with the
following:

{
"size": 0,
"facets": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
},
"aggs": {
"participants": {
"terms": {
"field": "actor.displayName",
"size": 10
}
}
}
}

This should give me roundabout the top 10 (*https://github.com/elasticsearch/elasticsearch/issues/1305)
occurring terms in the 'actor.displayName' field. The terms facet gives the
same counts over and over again, which is what is expected. However, the
counts from the aggregations return different numbers every time I invoke
it. Results of 3 consecutive runs:
gist:8733837 · GitHub.

Currently I'm reindexing all the documents in an index with only one shard
to see if that makes a difference.
This would only solve the problem short term, but our production load is
too big to fit in one shard.

-- Nils

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e2e84dc5-cd11-476c-90b4-a0aa5e0fdd72%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nils,

This is just the nature of splitting data around in shards. Actually the
terms facet has the same limitations (i.e. it will also give "approximate
counts"). Neither the terms facet nor the terms aggregation is better or
worse than the other - they are both approximations (using different
implementations). It is correct that if you put all your data in 1 shard,
then all the counts are exact. If you need to shard, you can increase the
"shard_size" parameter inside the terms aggregation to "improve accuracy".
Play with that number until it suits your purposes but the important thing
is they are just approximations the more documents you have in the index -
so just don't expect absolute numbers from them if you have more than 1
shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e86b5a00-b2ba-4ce9-a116-fbbddf2ebffe%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305 in
my initial post), and have been advocating slightly incorrect numbers with
my colleges and customers for some time already to prepare them for the
moment we provide analytics with ES. But what bothers me is that they are
inconsistent.

If you look at my gist you see that I ran the same aggs 3 times right after
each other. If we just look at the top item we see the following results:

  1. { "key": "totaltrafficbos", "doc_count": 2880 }
  2. { "key": "totaltrafficbos", "doc_count": 2552 }
  3. { "key": "totaltrafficbos", "doc_count": 2179 }

These results are taken within seconds without any change to the number of documents in the index. If I run them even more you see that it rotates between a hand full of numbers. Is this also behavior one would expect from the aggs? And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:

Nils,

This is just the nature of splitting data around in shards. Actually the
terms facet has the same limitations (i.e. it will also give "approximate
counts"). Neither the terms facet nor the terms aggregation is better or
worse than the other - they are both approximations (using different
implementations). It is correct that if you put all your data in 1 shard,
then all the counts are exact. If you need to shard, you can increase the
"shard_size" parameter inside the terms aggregation to "improve accuracy".
Play with that number until it suits your purposes but the important thing
is they are just approximations the more documents you have in the index -
so just don't expect absolute numbers from them if you have more than 1
shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/13053d4e-a213-4f42-8f16-09e539ad694c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I've loaded the same dataset in ES1.0.0.Beta2 with the same index
configuration as in the topic start.

However now the numbers are consistent if I call the same aggregation
multiple times in a row AND the number match the numbers of the facets.
This leads me to the conclusion something is broken from Beta2 to RC1!

I would like to test this on master, but I could not find any nightly
builds of elasticsearch. Is there a location where they are stored or
should I compile it myself?

On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:

Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305 in
my initial post), and have been advocating slightly incorrect numbers with
my colleges and customers for some time already to prepare them for the
moment we provide analytics with ES. But what bothers me is that they are
inconsistent.

If you look at my gist you see that I ran the same aggs 3 times right
after each other. If we just look at the top item we see the following
results:

  1. { "key": "totaltrafficbos", "doc_count": 2880 }
  2. { "key": "totaltrafficbos", "doc_count": 2552 }
  3. { "key": "totaltrafficbos", "doc_count": 2179 }

These results are taken within seconds without any change to the number of documents in the index. If I run them even more you see that it rotates between a hand full of numbers. Is this also behavior one would expect from the aggs? And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:

Nils,

This is just the nature of splitting data around in shards. Actually the
terms facet has the same limitations (i.e. it will also give "approximate
counts"). Neither the terms facet nor the terms aggregation is better or
worse than the other - they are both approximations (using different
implementations). It is correct that if you put all your data in 1 shard,
then all the counts are exact. If you need to shard, you can increase the
"shard_size" parameter inside the terms aggregation to "improve accuracy".
Play with that number until it suits your purposes but the important thing
is they are just approximations the more documents you have in the index -
so just don't expect absolute numbers from them if you have more than 1
shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6bee2ff8-ae78-4837-91f5-77ee80f55d34%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

To follow up,

I have a contained test suite at https://gist.github.com/thanodnl/8803745for this problem. It contains two files:

  1. aggsbug.sh
  2. aggsbug.json

The .json file contains ~1M documents newline separated to load into the
database, I was not able to create a curl request to load them directly
into the index.
The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh)
contains the instructions for recreating this behavior.

I have ran these against the following version:

  1. 1.0.0.Beta2
  2. 1.0.0.RC1
  3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit
    0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763

When ran on 1.0.0.Beta2 it gives the same output consistently when I run
the _search over and over again.
When ran on 1.0.0.RC1 it will give me multiple different outcomes
comparable to the numbers I posted earlier in the thread,
When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.

That it still was working on 1.0.0.Beta2 proves to me that it is a bug that
got into RC1. I could not find any related ticket on the issues page of the
github repository. Hopefully this is enough information to recreate the
problem.

The json file is quite big and could bug when you open the gist it in a
browser. A clone of the gist locally will work best:
$ git clone elasticsearch aggs bug · GitHub

I do not really know how to move on from here. Do you want me to open an
issue for this problem at GitHub - elastic/elasticsearch: Free and Open, Distributed, RESTful Search Engine? It would
be nice to fix this problem before a release of 1.0.0 since that is the
first release containing the aggregations for analytics.

On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:

I've loaded the same dataset in ES1.0.0.Beta2 with the same index
configuration as in the topic start.

However now the numbers are consistent if I call the same aggregation
multiple times in a row AND the number match the numbers of the facets.
This leads me to the conclusion something is broken from Beta2 to RC1!

I would like to test this on master, but I could not find any nightly
builds of elasticsearch. Is there a location where they are stored or
should I compile it myself?

On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:

Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305
in my initial post), and have been advocating slightly incorrect numbers
with my colleges and customers for some time already to prepare them for
the moment we provide analytics with ES. But what bothers me is that they
are inconsistent.

If you look at my gist you see that I ran the same aggs 3 times right
after each other. If we just look at the top item we see the following
results:

  1. { "key": "totaltrafficbos", "doc_count": 2880 }
  2. { "key": "totaltrafficbos", "doc_count": 2552 }
  3. { "key": "totaltrafficbos", "doc_count": 2179 }

These results are taken within seconds without any change to the number of documents in the index. If I run them even more you see that it rotates between a hand full of numbers. Is this also behavior one would expect from the aggs? And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:

Nils,

This is just the nature of splitting data around in shards. Actually the
terms facet has the same limitations (i.e. it will also give "approximate
counts"). Neither the terms facet nor the terms aggregation is better or
worse than the other - they are both approximations (using different
implementations). It is correct that if you put all your data in 1 shard,
then all the counts are exact. If you need to shard, you can increase the
"shard_size" parameter inside the terms aggregation to "improve accuracy".
Play with that number until it suits your purposes but the important thing
is they are just approximations the more documents you have in the index -
so just don't expect absolute numbers from them if you have more than 1
shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry, but your file at elasticsearch aggs bug · GitHub is broken, it
contains invalid JSON, so it can not be processed.

It would be helpful to provide a script with escaped JSON in bulk format.

From what I suspect, you do not use keyword analyzer for faceting/agg'ing,
so you will get all kinds of unwanted results. If that explains your
fluctuating aggs results, I can not tell. It is rather uncommon to use
"facets" and "aggs" side by side.

Jörg

On Tue, Feb 4, 2014 at 3:01 PM, Nils Dijk me@thanod.nl wrote:

To follow up,

I have a contained test suite at https://gist.github.com/thanodnl/8803745for this problem. It contains two files:

  1. aggsbug.sh
  2. aggsbug.json

The .json file contains ~1M documents newline separated to load into the
database, I was not able to create a curl request to load them directly
into the index.
The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh)
contains the instructions for recreating this behavior.

I have ran these against the following version:

  1. 1.0.0.Beta2
  2. 1.0.0.RC1
  3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit
    0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763

When ran on 1.0.0.Beta2 it gives the same output consistently when I run
the _search over and over again.
When ran on 1.0.0.RC1 it will give me multiple different outcomes
comparable to the numbers I posted earlier in the thread,
When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.

That it still was working on 1.0.0.Beta2 proves to me that it is a bug
that got into RC1. I could not find any related ticket on the issues page
of the github repository. Hopefully this is enough information to recreate
the problem.

The json file is quite big and could bug when you open the gist it in a
browser. A clone of the gist locally will work best:
$ git clone elasticsearch aggs bug · GitHub

I do not really know how to move on from here. Do you want me to open an
issue for this problem at GitHub - elastic/elasticsearch: Free and Open, Distributed, RESTful Search Engine? It
would be nice to fix this problem before a release of 1.0.0 since that is
the first release containing the aggregations for analytics.

On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:

I've loaded the same dataset in ES1.0.0.Beta2 with the same index
configuration as in the topic start.

However now the numbers are consistent if I call the same aggregation
multiple times in a row AND the number match the numbers of the facets.
This leads me to the conclusion something is broken from Beta2 to RC1!

I would like to test this on master, but I could not find any nightly
builds of elasticsearch. Is there a location where they are stored or
should I compile it myself?

On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:

Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305
in my initial post), and have been advocating slightly incorrect numbers
with my colleges and customers for some time already to prepare them for
the moment we provide analytics with ES. But what bothers me is that they
are inconsistent.

If you look at my gist you see that I ran the same aggs 3 times right
after each other. If we just look at the top item we see the following
results:

  1. { "key": "totaltrafficbos", "doc_count": 2880 }
  2. { "key": "totaltrafficbos", "doc_count": 2552 }
  3. { "key": "totaltrafficbos", "doc_count": 2179 }

These results are taken within seconds without any change to the number of documents in the index. If I run them even more you see that it rotates between a hand full of numbers. Is this also behavior one would expect from the aggs? And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:

Nils,

This is just the nature of splitting data around in shards. Actually
the terms facet has the same limitations (i.e. it will also give
"approximate counts"). Neither the terms facet nor the terms aggregation is
better or worse than the other - they are both approximations (using
different implementations). It is correct that if you put all your data in
1 shard, then all the counts are exact. If you need to shard, you can
increase the "shard_size" parameter inside the terms aggregation to
"improve accuracy". Play with that number until it suits your purposes but
the important thing is they are just approximations the more documents you
have in the index - so just don't expect absolute numbers from them if you
have more than 1 shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEMMy4mkHPYhJYpsOwY-2TdHtS9vAS0Enu0U93jfkEFwQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

I updated the gist now with a file in bulkindex format.
I also split up the loading from the testing phase, so you can do the test
multiple times in a row.
I also added a README.md to instruct how to run the test.

I'm also creating a bug as stated here
Elasticsearch Platform — Find real-time answers at scale | Elastic.

On Wednesday, February 5, 2014 9:49:40 AM UTC+1, Jörg Prante wrote:

Sorry, but your file at https://gist.github.com/8803745.git is broken,
it contains invalid JSON, so it can not be processed.

It would be helpful to provide a script with escaped JSON in bulk format.

From what I suspect, you do not use keyword analyzer for faceting/agg'ing,
so you will get all kinds of unwanted results. If that explains your
fluctuating aggs results, I can not tell. It is rather uncommon to use
"facets" and "aggs" side by side.

Jörg

On Tue, Feb 4, 2014 at 3:01 PM, Nils Dijk <m...@thanod.nl <javascript:>>wrote:

To follow up,

I have a contained test suite at https://gist.github.com/thanodnl/8803745for this problem. It contains two files:

  1. aggsbug.sh
  2. aggsbug.json

The .json file contains ~1M documents newline separated to load into the
database, I was not able to create a curl request to load them directly
into the index.
The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh)
contains the instructions for recreating this behavior.

I have ran these against the following version:

  1. 1.0.0.Beta2
  2. 1.0.0.RC1
  3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit
    0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763

When ran on 1.0.0.Beta2 it gives the same output consistently when I run
the _search over and over again.
When ran on 1.0.0.RC1 it will give me multiple different outcomes
comparable to the numbers I posted earlier in the thread,
When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.

That it still was working on 1.0.0.Beta2 proves to me that it is a bug
that got into RC1. I could not find any related ticket on the issues page
of the github repository. Hopefully this is enough information to recreate
the problem.

The json file is quite big and could bug when you open the gist it in a
browser. A clone of the gist locally will work best:
$ git clone elasticsearch aggs bug · GitHub

I do not really know how to move on from here. Do you want me to open an
issue for this problem at GitHub - elastic/elasticsearch: Free and Open, Distributed, RESTful Search Engine? It
would be nice to fix this problem before a release of 1.0.0 since that is
the first release containing the aggregations for analytics.

On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:

I've loaded the same dataset in ES1.0.0.Beta2 with the same index
configuration as in the topic start.

However now the numbers are consistent if I call the same aggregation
multiple times in a row AND the number match the numbers of the facets.
This leads me to the conclusion something is broken from Beta2 to RC1!

I would like to test this on master, but I could not find any nightly
builds of elasticsearch. Is there a location where they are stored or
should I compile it myself?

On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:

Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305
in my initial post), and have been advocating slightly incorrect numbers
with my colleges and customers for some time already to prepare them for
the moment we provide analytics with ES. But what bothers me is that they
are inconsistent.

If you look at my gist you see that I ran the same aggs 3 times right
after each other. If we just look at the top item we see the following
results:

  1. { "key": "totaltrafficbos", "doc_count": 2880 }
  2. { "key": "totaltrafficbos", "doc_count": 2552 }
  3. { "key": "totaltrafficbos", "doc_count": 2179 }

These results are taken within seconds without any change to the number of documents in the index. If I run them even more you see that it rotates between a hand full of numbers. Is this also behavior one would expect from the aggs? And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:

Nils,

This is just the nature of splitting data around in shards. Actually
the terms facet has the same limitations (i.e. it will also give
"approximate counts"). Neither the terms facet nor the terms aggregation is
better or worse than the other - they are both approximations (using
different implementations). It is correct that if you put all your data in
1 shard, then all the counts are exact. If you need to shard, you can
increase the "shard_size" parameter inside the terms aggregation to
"improve accuracy". Play with that number until it suits your purposes but
the important thing is they are just approximations the more documents you
have in the index - so just don't expect absolute numbers from them if you
have more than 1 shard.

{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "actor.displayName",
"shard_size": 10000
}
}
}
}

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b911b272-53c6-4bd2-9185-4f66dfeb0de0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks. I tried to reproduce it on 1.0.0.RC2, but without success.

curl '0:9200/aggsbug/_mapping?pretty'
{
"aggsbug" : {
"mappings" : {
"messages" : {
"properties" : {
"a" : {
"type" : "string",
"analyzer" : "keyword"
}
}
}
}
}
}

Using analyzer "keyword", the "aggregations" is working flawlessly here,
with constant result.

curl '0:9200/aggsbug/_search?pretty' -d '
{
"size": 0,
"aggs": {
"a": {
"terms": {
"field": "a",
"size": 10
}
}
}
}
'
{
"took" : 669,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "TotalTrafficBOS",
"doc_count" : 3599
}, {
"key" : "MAI93thm",
"doc_count" : 2517
}, {
"key" : "MAI90thm",
"doc_count" : 2207
}, {
"key" : "MAI95thm",
"doc_count" : 2207
}, {
"key" : "TotalTrafficNYC",
"doc_count" : 1660
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "NJI80thm",
"doc_count" : 1180
}, {
"key" : "PAI76thm",
"doc_count" : 1142
}, {
"key" : "TXI35thm",
"doc_count" : 1064
}, {
"key" : "NYI87thm",
"doc_count" : 1029
} ]
}
}
}

Jörg

On Wed, Feb 5, 2014 at 2:17 PM, Nils Dijk me@thanod.nl wrote:

Hi,

I updated the gist now with a file in bulkindex format.
I also split up the loading from the testing phase, so you can do the test
multiple times in a row.
I also added a README.md to instruct how to run the test.

I'm also creating a bug as stated here
Elasticsearch Platform — Find real-time answers at scale | Elastic.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFRak9JtwQNnEdd%3DPGzJRqiqpCMEJXSAsgZ52OztJiTJw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to process
all the docs correctly while indexing or searching, maybe you encountered
OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoH-8qYvcDmTTMQe4mz1yyhxMig1pA08-ma6xyJ%3DBZXeow%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

I did only test it with 1 and with 10 shards, indeed with 1 shard it did
not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist (which
uses 10 shards).

Also I do not think the analyzer should have impact, since it would only
index more terms on that field if it tokenizes it. Can you use the
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run the
test? It should give you a field analyzed with the default analyzer and 10
shards.

I'll try out some different analyzers, and loading the data in 3 shards now
to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to process
all the docs correctly while indexing or searching, maybe you encountered
OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I didn't manage to reproduce the issue locally either. What JVM / OS are
you using (RC1 introduced Unsafe to perform String comparisons in terms
aggs so I'm wondering if that could be related to your issue)?

On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk me@thanod.nl wrote:

I did only test it with 1 and with 10 shards, indeed with 1 shard it did
not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist (which
uses 10 shards).

Also I do not think the analyzer should have impact, since it would only
index more terms on that field if it tokenizes it. Can you use the
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run the
test? It should give you a field analyzed with the default analyzer and 10
shards.

I'll try out some different analyzers, and loading the data in 3 shards
now to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to process
all the docs correctly while indexing or searching, maybe you encountered
OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6D_96Bs9kt5dnQ5MgpUCp%3DxELWyH%2BqinCv8uqPuWtq%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Adrien,

I'm using OSX (Mavericks) and java: (having the issue)

$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

My colleague is running OSX (Lion) and java: (having the issue)

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11D50)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

A server soon to be used for production Ubuntu 12.04 LTS with java: (Not
having the issue)

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Could this be an issue with java on OSX than?

On Wednesday, February 5, 2014 4:38:36 PM UTC+1, Adrien Grand wrote:

I didn't manage to reproduce the issue locally either. What JVM / OS are
you using (RC1 introduced Unsafe to perform String comparisons in terms
aggs so I'm wondering if that could be related to your issue)?

On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk <m...@thanod.nl <javascript:>>wrote:

I did only test it with 1 and with 10 shards, indeed with 1 shard it did
not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist (which
uses 10 shards).

Also I do not think the analyzer should have impact, since it would
only index more terms on that field if it tokenizes it. Can you use the
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run
the test? It should give you a field analyzed with the default analyzer and
10 shards.

I'll try out some different analyzers, and loading the data in 3 shards
now to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to process
all the docs correctly while indexing or searching, maybe you encountered
OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f0f80b7-fbf2-4747-90d4-725a06560938%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I just installed 1.7u25 on a mac with maverick to try to reproduce the
issue, but without success (on 1.0.0-RC2).

On Wed, Feb 5, 2014 at 4:49 PM, Nils Dijk me@thanod.nl wrote:

Hi Adrien,

I'm using OSX (Mavericks) and java: (having the issue)

$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

My colleague is running OSX (Lion) and java: (having the issue)

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11D50)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

A server soon to be used for production Ubuntu 12.04 LTS with java: (Not
having the issue)

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Could this be an issue with java on OSX than?

On Wednesday, February 5, 2014 4:38:36 PM UTC+1, Adrien Grand wrote:

I didn't manage to reproduce the issue locally either. What JVM / OS are
you using (RC1 introduced Unsafe to perform String comparisons in terms
aggs so I'm wondering if that could be related to your issue)?

On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk m...@thanod.nl wrote:

I did only test it with 1 and with 10 shards, indeed with 1 shard it did
not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist
(which uses 10 shards).

Also I do not think the analyzer should have impact, since it would
only index more terms on that field if it tokenizes it. Can you use the
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run
the test? It should give you a field analyzed with the default analyzer and
10 shards.

I'll try out some different analyzers, and loading the data in 3 shards
now to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to
process all the docs correctly while indexing or searching, maybe you
encountered OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8f0f80b7-fbf2-4747-90d4-725a06560938%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j4Y5EA5qAxE1BkLmbBX_7xgwZKPz00x_96YM4X9qLNE4w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the effort.

I tried running on 1.7.0_51, and it gave me the same issue.

I was trying to find out if I could disable this unsafe string comparisons,
but could not really find where that should be disabled. Is there an easy
way for me to switch back that change? Do you know on what commit this was
changed so I can revert that commit in my local clone of the repo, do a
build to see if the problem is solved that way?

For reproducing I do not really see what could impact this besides from the
OS and java version. And the other OSX machine was a different version of
OS AND java, and still having the same results.

I am however a bit more relaxed with the issue not showing up on our
production machines, that would have killed the ES migration we are
currently doing. Although it is unfortunate that we can not test our stuff
on our developement machines (all showing the issue here).

Do you have any thoughts on what could be different between our setups that
we are having the issue, and you don't?

To make sure, you use my scripts to load it in? Since Jörg seemed to load
the data on a different way (different shardcount and different mapping)
which did not show the issues here.

On Wednesday, February 5, 2014 5:40:10 PM UTC+1, Adrien Grand wrote:

I just installed 1.7u25 on a mac with maverick to try to reproduce the
issue, but without success (on 1.0.0-RC2).

On Wed, Feb 5, 2014 at 4:49 PM, Nils Dijk <m...@thanod.nl <javascript:>>wrote:

Hi Adrien,

I'm using OSX (Mavericks) and java: (having the issue)

$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

My colleague is running OSX (Lion) and java: (having the issue)

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11D50)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

A server soon to be used for production Ubuntu 12.04 LTS with java: (Not
having the issue)

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Could this be an issue with java on OSX than?

On Wednesday, February 5, 2014 4:38:36 PM UTC+1, Adrien Grand wrote:

I didn't manage to reproduce the issue locally either. What JVM / OS are
you using (RC1 introduced Unsafe to perform String comparisons in terms
aggs so I'm wondering if that could be related to your issue)?

On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk m...@thanod.nl wrote:

I did only test it with 1 and with 10 shards, indeed with 1 shard it
did not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist
(which uses 10 shards).

Also I do not think the analyzer should have impact, since it would
only index more terms on that field if it tokenizes it. Can you use the
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run
the test? It should give you a field analyzed with the default analyzer and
10 shards.

I'll try out some different analyzers, and loading the data in 3 shards
now to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:

Also the same with shards = 3 and analyzer = standard. Stable results.

{
"took" : 240,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 1142
}, {
"key" : "txi35thm",
"doc_count" : 379
} ]
}
}
}

You should examine your log files if your ES cluster was able to
process all the docs correctly while indexing or searching, maybe you
encountered OOMs or other subtle issues.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%
40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8f0f80b7-fbf2-4747-90d4-725a06560938%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/661879af-be97-452e-ba7c-dbcaf9229002%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wed, Feb 5, 2014 at 6:01 PM, Nils Dijk me@thanod.nl wrote:

I was trying to find out if I could disable this unsafe
string comparisons, but could not really find where that should be
disabled. Is there an easy way for me to switch back that change? Do you
know on what commit this was changed so I can revert that commit in my
local clone of the repo, do a build to see if the problem is solved that
way?

Sure, this was changed in 4271d573d60f39564c458e2d3fb7c14afb82d4d8 However
I also just read that you can't reproduce the issue with one shard although
this shouldn't be relevant.

For reproducing I do not really see what could impact this besides from
the OS and java version. And the other OSX machine was a different version
of OS AND java, and still having the same results.

I am however a bit more relaxed with the issue not showing up on our
production machines, that would have killed the ES migration we are
currently doing. Although it is unfortunate that we can not test our stuff
on our developement machines (all showing the issue here).

Do you have any thoughts on what could be different between our setups
that we are having the issue, and you don't?

I wish I had ideas! :slight_smile:

Since the issue seems to reproduce consistently for you, something that
would be super helpful would be to git bisect in order to find the commit
that broke aggregations in your setup (Beta2 commit is 296cfbe3 and rc1
commit is 2c8ee3fb).

To make sure, you use my scripts to load it in? Since Jörg seemed to load
the data on a different way (different shardcount and different mapping)
which did not show the issues here.

Yes, I used your scripts, exactly as described in the README.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7pAMdOPGoy5ssjdAHLa4eMntKnCZPLH6U9Ft2TZaO77w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ok, I was preparing to do a long bisecting session, but I started with the
commit you highlighted below (4271d573d60f39564c458e2d3fb7c14afb82d4d8) and
the commit before that one (6481a2fde858520988f2ce28c02a15be3fe108e4). And
as it turns out, it is the breaking commit.

If I build the commit of yours from December 3 it fails my test suite.
If I build the commit of Nik from Januari 6 it still passes my test.

I also tried reverting your commit on the v1.0.0.RC1 tag, but it gave me
all kinds of conflicts so I could not test RC1 without your commit.

If you would like I can still do a full bisect, but I suspect I end up at
your commit since I tested that one, and the one before.

Would it be possible for you to send a .patch without the unsafe stuff, so
I can apply that to a commit and make a build?

Thanks in advance,

On Wednesday, February 5, 2014 6:10:35 PM UTC+1, Adrien Grand wrote:

On Wed, Feb 5, 2014 at 6:01 PM, Nils Dijk <m...@thanod.nl <javascript:>>wrote:

I was trying to find out if I could disable this unsafe
string comparisons, but could not really find where that should be
disabled. Is there an easy way for me to switch back that change? Do you
know on what commit this was changed so I can revert that commit in my
local clone of the repo, do a build to see if the problem is solved that
way?

Sure, this was changed in 4271d573d60f39564c458e2d3fb7c14afb82d4d8 However
I also just read that you can't reproduce the issue with one shard although
this shouldn't be relevant.

For reproducing I do not really see what could impact this besides from
the OS and java version. And the other OSX machine was a different version
of OS AND java, and still having the same results.

I am however a bit more relaxed with the issue not showing up on our
production machines, that would have killed the ES migration we are
currently doing. Although it is unfortunate that we can not test our stuff
on our developement machines (all showing the issue here).

Do you have any thoughts on what could be different between our setups
that we are having the issue, and you don't?

I wish I had ideas! :slight_smile:

Since the issue seems to reproduce consistently for you, something that
would be super helpful would be to git bisect in order to find the commit
that broke aggregations in your setup (Beta2 commit is 296cfbe3 and rc1
commit is 2c8ee3fb).

To make sure, you use my scripts to load it in? Since Jörg seemed to load
the data on a different way (different shardcount and different mapping)
which did not show the issues here.

Yes, I used your scripts, exactly as described in the README.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ab8f000d-d0ee-4be8-aaa5-46d0718c56e8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Nils, I ran the test on my Mac, and I can reproduce the issue. And also on
Linux.

Unfortunately the Mac locked up and I had to cold reboot, and my copy/paste
logs are gone with all the numbers, but anyway.

As a matter of fact, your aggregates demo is daunting.

On the Mac, it shows different counts even between the first and the
subsequent executions. The counts of the first are lower, and also, even
different terms show up. On Linux, I do not observe different counts
between runs.

But, what's more bothering is, I observed different results in regard to
the shard count, and that is both on Mac and Linux. The more the hit count
is on top of the buckets, the more the counts match, only the lower buckets
differ, so the deviating counts are somewhat hard to notice.

I use Java 8 FCS, but since you observe this issue also on Java 7, I think
it is not an issue of Java 8. And it's both on Mac and Linux, but with
different symptoms.

ES 1.0.0.RC2
Mac OS X 10.8.5
Darwin Jorg-Prantes-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0:
Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

ES 1.0.0.RC2
RHEL 6.3
Linux zephyros 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012
x86_64 x86_64 x86_64 GNU/Linux
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

Here are two Linux examples. Note, the last three terms and counts are
different.

shards=10

{
"took" : 143,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1071
}, {
"key" : "pai76thm",
"doc_count" : 1039
}, {
"key" : "txi35thm",
"doc_count" : 357
} ]
}
}
}

shards=5

{
"took" : 172,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" : [ ]
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 936
}, {
"key" : "nji78thm",
"doc_count" : 422
} ]
}
}
}

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHSG5U-3Lk3jKR6mDAYat0_%2BatMhhc1Y_j0hjr8w-0bTg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg,

Glad you could reproduce with my updated gist.

cb.

On Wednesday, February 5, 2014 8:18:39 PM UTC+1, Jörg Prante wrote:

Nils, I ran the test on my Mac, and I can reproduce the issue. And also on
Linux.

Unfortunately the Mac locked up and I had to cold reboot, and my
copy/paste logs are gone with all the numbers, but anyway.

As a matter of fact, your aggregates demo is daunting.

On the Mac, it shows different counts even between the first and the
subsequent executions. The counts of the first are lower, and also, even
different terms show up. On Linux, I do not observe different counts
between runs.

The issue you describe for Mac is the issue I discussed here.

But, what's more bothering is, I observed different results in regard to
the shard count, and that is both on Mac and Linux. The more the hit count
is on top of the buckets, the more the counts match, only the lower buckets
differ, so the deviating counts are somewhat hard to notice.

The counts differ when you change the shard size is long known problem of
elasticsearch and was also a problem in faceting. A long thread about the
nature of this problem can be found here:
terms facet gives wrong count with n_shards > 1 · Issue #1305 · elastic/elasticsearch · GitHub.

It is an issue which you can circumvent easily by one of two options:

  1. Use the term you do the aggregation for as a routing key. This forces
    to have the same tokens in the same shard, and thus always return the exact
    count. Although this only works if you do these kind of analytics over 1
    field.
  2. Increase the shard_size for the terms aggregation. This way the
    internal shards create bigger lists which than have more chance of
    containing the actual top terms.
    Elasticsearch Platform — Find real-time answers at scale | Elastic

I use Java 8 FCS, but since you observe this issue also on Java 7, I think
it is not an issue of Java 8. And it's both on Mac and Linux, but with
different symptoms.

This makes the only factor occurring multiple times the MacOSX OS. And on
all java versions, I tested both 1.7 and 1.6. It is unfortunate that Adrien
wasn't able to reproduce it on OSX.

ES 1.0.0.RC2
Mac OS X 10.8.5
Darwin Jorg-Prantes-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0:
Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

ES 1.0.0.RC2
RHEL 6.3
Linux zephyros 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012
x86_64 x86_64 x86_64 GNU/Linux
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

Here are two Linux examples. Note, the last three terms and counts are
different.

shards=10

{
"took" : 143,
"timed_out" : false,
"_shards" : {
"total" : 10,
"successful" : 10,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1071
}, {
"key" : "pai76thm",
"doc_count" : 1039
}, {
"key" : "txi35thm",
"doc_count" : 357
} ]
}
}
}

shards=5

{
"took" : 172,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" :
},
"aggregations" : {
"a" : {
"buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
}, {
"key" : "mai93thm",
"doc_count" : 2517
}, {
"key" : "mai90thm",
"doc_count" : 2207
}, {
"key" : "mai95thm",
"doc_count" : 2207
}, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
}, {
"key" : "confessions",
"doc_count" : 1534
}, {
"key" : "incidentreports",
"doc_count" : 1468
}, {
"key" : "nji80thm",
"doc_count" : 1180
}, {
"key" : "pai76thm",
"doc_count" : 936
}, {
"key" : "nji78thm",
"doc_count" : 422
} ]
}
}
}

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8b1df0c8-5ad2-4a08-9bda-4e20026756c0%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oh, ok, I see, there is a shard count and size issue... I thought the
aggregations framework is able to collect bucket counts from shards in a
round robin fashion, so the counts of the global bucket are always accurate
and do not depend on the shard number.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFd0ucPteTPngnxmK5GoheFykrbc9Ug-UFFsdQ3D_XCzA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.