i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.
i tried facet query, and the count number is less than my expect. here
is my test:
query1 (match certain results)
curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1" -d '
{
"query": {"query_string":{"query":"pub_cat_id:145"}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 is 501,779, and it's the same as the
results i got from oracle.
query2 (match all query)
curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1" -d '
{
"query": {"match_all":{}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 become less, and it's 500,337.
a bug here? or something i miss used? any comments or suggestions are
much appreciated. thanks.
elasticsearch version is use is 0.19.9, and pub_cat_id is a single
valued integer. i index doc like this,
Here is the key info:
"This happens because of the way the distributed facet calculation works.
It gets the top 5 from each shard, and then aggregates it. Because you have
an even distribution of terms, it will not return exact matches. if you
increase the size, you will get better results. One possible value can be
the size times the number of shards, for example: 25."
Try increasing the facet count. It will help, but still isn't bullet proof.
The only bullet proof (albeit crappy) solutions are:
Have all your data in a single shard
Have your count set to the number of distinct terms. This is overkill,
though, and results are usual good enough following Shays guidance (size *
of shards)
Best Regards,
Paul
On Tuesday, October 9, 2012 2:52:28 AM UTC-6, Brian Hu wrote:
hi all,
i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.
i tried facet query, and the count number is less than my expect. here
is my test:
query1 (match certain results)
curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1<http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1>"
-d '
{
"query": {"query_string":{"query":"pub_cat_id:145"}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 is 501,779, and it's the same as the
results i got from oracle.
query2 (match all query)
The key comment:
This happens because of the way the distributed facet calculation works. It
gets the top 5 from each shard, and then aggregates it. Because you have an
even distribution of terms, it will not return exact matches. if you
increase the size, you will get better results. One possible value can be
the size times the number of shards, for example: 25.
Couple of reasonable, albeit not ideal, workarounds:
Request a much higher count than you actually need
If you need to guarantee exact counts, have a two pass query. Run the
first query to get your initial list and feed those back in as part of the
query to get the exact counts.
Best Regards,
Paul
On Tuesday, October 9, 2012 2:52:28 AM UTC-6, Brian Hu wrote:
hi all,
i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.
i tried facet query, and the count number is less than my expect. here
is my test:
query1 (match certain results)
curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1<http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1>"
-d '
{
"query": {"query_string":{"query":"pub_cat_id:145"}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 is 501,779, and it's the same as the
results i got from oracle.
query2 (match all query)
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.