Wrong facet counts?


(guanyum) #1

hi all,

i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.

i tried facet query, and the count number is less than my expect. here
is my test:

query1 (match certain results)

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1" -d '
{
"query": {"query_string":{"query":"pub_cat_id:145"}},
"facets":{
 "pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 is 501,779, and it's the same as the
results i got from oracle.

query2  (match all query)

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1" -d '
{
"query": {"match_all":{}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 become less, and it's 500,337.

a bug here? or something i miss used? any comments or suggestions are
much appreciated. thanks.

elasticsearch version is use is 0.19.9, and pub_cat_id is a single
valued integer. i index doc like this,

XContentFactory.jsonBuilder().startObject().startObject(typeName)
.field("pub_cat_id", aProductDo.getCategoryId())

--


(ppearcy) #2

Hi,
Take a look at this discussion:
http://elasticsearch-users.115913.n3.nabble.com/Inconsistent-facet-count-td2977512.html

Here is the key info:
"This happens because of the way the distributed facet calculation works.
It gets the top 5 from each shard, and then aggregates it. Because you have
an even distribution of terms, it will not return exact matches. if you
increase the size, you will get better results. One possible value can be
the size times the number of shards, for example: 25."

Try increasing the facet count. It will help, but still isn't bullet proof.
The only bullet proof (albeit crappy) solutions are:

  • Have all your data in a single shard
  • Have your count set to the number of distinct terms. This is overkill,
    though, and results are usual good enough following Shays guidance (size *

of shards)

Best Regards,
Paul

On Tuesday, October 9, 2012 2:52:28 AM UTC-6, Brian Hu wrote:

hi all,

i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.

i tried facet query, and the count number is less than my expect. here
is my test:

query1 (match certain results)

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search? 
size=1&pretty=1<http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1>" 
-d ' 
{ 
"query": {"query_string":{"query":"pub_cat_id:145"}}, 
"facets":{ 
 "pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}} 
} 
' 
and the counts for pub_cat_id:145 is 501,779, and it's the same as the 
results i got from oracle. 

query2  (match all query) 

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1"
-d '
{
"query": {"match_all":{}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 become less, and it's 500,337.

a bug here? or something i miss used? any comments or suggestions are
much appreciated. thanks.

elasticsearch version is use is 0.19.9, and pub_cat_id is a single
valued integer. i index doc like this,

XContentFactory.jsonBuilder().startObject().startObject(typeName)
.field("pub_cat_id", aProductDo.getCategoryId())

--


(ppearcy) #3

Heya,
Check out this thread:
http://elasticsearch-users.115913.n3.nabble.com/Inconsistent-facet-count-td2977512.html

The key comment:
This happens because of the way the distributed facet calculation works. It
gets the top 5 from each shard, and then aggregates it. Because you have an
even distribution of terms, it will not return exact matches. if you
increase the size, you will get better results. One possible value can be
the size times the number of shards, for example: 25.

Couple of reasonable, albeit not ideal, workarounds:

  • Request a much higher count than you actually need
  • If you need to guarantee exact counts, have a two pass query. Run the
    first query to get your initial list and feed those back in as part of the
    query to get the exact counts.

Best Regards,
Paul

On Tuesday, October 9, 2012 2:52:28 AM UTC-6, Brian Hu wrote:

hi all,

i indexed 4,230,593 docs, into a 3 es-node cluster, with 6 shards, 2
index types.

i tried facet query, and the count number is less than my expect. here
is my test:

query1 (match certain results)

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search? 
size=1&pretty=1<http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1>" 
-d ' 
{ 
"query": {"query_string":{"query":"pub_cat_id:145"}}, 
"facets":{ 
 "pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}} 
} 
' 
and the counts for pub_cat_id:145 is 501,779, and it's the same as the 
results i got from oracle. 

query2  (match all query) 

curl -X POST "http://10.20.157.111:9200/sourcing/product/_search?
size=1&pretty=1http://10.20.157.111:9200/sourcing/product/_search?size=1&pretty=1"
-d '
{
"query": {"match_all":{}},
"facets":{
"pub_cat" : {"terms":{"field" : "pub_cat_id","size": 1}}}
}
'
and the counts for pub_cat_id:145 become less, and it's 500,337.

a bug here? or something i miss used? any comments or suggestions are
much appreciated. thanks.

elasticsearch version is use is 0.19.9, and pub_cat_id is a single
valued integer. i index doc like this,

XContentFactory.jsonBuilder().startObject().startObject(typeName)
.field("pub_cat_id", aProductDo.getCategoryId())

--


(system) #4