Field sorting and facets

Crwe · May 8, 2012, 9:40am

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-on-string-types-with-more-than-one-value-per-doc-or-more-than-one-token-per-field/
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

vinhphu1711 · May 8, 2012, 9:56am

Hi,

For question 1: I don't see any problem with that solution. In my opinion, 'multi_field' fits perfectly with your requirement.
For question 2: yes it's related to question 1 ;).

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-on-string-types-with-more-than-one-value-per-doc-or-more-than-one-token-per-field/
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

Crwe · May 8, 2012, 10:10am

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

vinhphu1711 · May 8, 2012, 10:15am

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described at Elasticsearch Platform — Find real-time answers at scale | Elastic

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

Crwe · May 8, 2012, 12:35pm

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

Crwe · May 8, 2012, 1:36pm

Ok, I managed to create a public version in the end.

The indexing script: #!/usr/bin/env python# -*- coding: utf-8 -*-import pyesimport cPickle - Pastebin.com
Data: http://minus.com/meJfL8TFr/ (python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}
}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

vinhphu1711 · May 9, 2012, 8:16am

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:

github.com/elastic/elasticsearch

terms facet gives wrong count with n_shards > 1

opened 09:32AM - 06 Sep 11 UTC

closed 08:29PM - 14 Jul 15 UTC

jmchambers

>enhancement high hanging fruit

I'm working with nested documents and have noticed that my faceted search interf…ace is giving the wrong counts when I have more than one shard. To be more specific, I'm working with RDF triples (entity > attribute > value) and I'm nesting the attributes (called predicates in my example): ``` { "_id" : "512a2c022f0b4e3daa341e6c8bcf6c2f", "url": "http://dbpedia.org/resource/Alan_Shepard", "predicates": [ { "type": "type", "string_value": ["thing", "person", "astronaut"] }, { "type": "label", "string_value": ["Alan Shepard"] }, { "type": "time in space", "float_value": [216.950] }, ... lots more ] } ``` I've created a shell script (https://gist.github.com/1196986) that recreates the problem with a fresh index. The created data set has these totals: - thing (30) - creative work (20) - video game (10) - tv show (10) - people (10) With only **one shard** the following query gives the correct counts no matter what the size parameter is set to: ``` { "size": 0, "query": { "match_all": {} }, "facets": { "type_counts": { "terms": { "field": "string_value", "size": 5 }, "nested": "predicates", "facet_filter": { "term": { "type": "type" } } } } } ``` However, with **more than one shard** the size parameter affects the accuracy of the counts. If it is equal to or greater than the number of terms returned by the facet query (5 in this case) then it works fine. However, the terms at the bottom of the list start to display low counts as you reduce the size parameter: With "size" : 4 - thing (30) - creative work (20) - video game (10) - **tv show (9)** With "size" : 3 - thing (30) - **creative work (15)** - **video game (9)** With "size" : 2 - thing (30) - **creative work (15)** So it looks like the sub-totals from some of the shards aren't being included for some reason. BTW I'm on ubuntu and the problem seems to affect all versions of ES I've tried (17.0, 17.1 and 17.6). Any ideas...? P.S. absolutely loving ES - it's made my life a lot easier :)

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:#!/usr/bin/env python# -*- coding: utf-8 -*-import pyesimport cPickle - Pastebin.com
Data:http://minus.com/meJfL8TFr/(python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

vinhphu1711 · May 9, 2012, 10:22am

Kimchy's comment in the ticket https://github.com/elasticsearch/elasticsearch/issues/1305 makes sense. I quote it here:
"Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)"

I think the root cause of the problem is the naiive top N facet merging.
In this query:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

if you increase the facet size (currently 10) to very large, the result is correct for all entries.

I don't know if they have a plan to fix this. I've tried latest stable ES version 0.19.3, the problem remains.

LTVP

On May 9, 2012, at 4:16 PM, Phu Le wrote:

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:
https://github.com/elasticsearch/elasticsearch/issues/1305

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:#!/usr/bin/env python# -*- coding: utf-8 -*-import pyesimport cPickle - Pastebin.com
Data:http://minus.com/meJfL8TFr/(python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

Crwe · May 9, 2012, 11:53am

Thank you again LTVP,

very helpful link and comments.

If I understand correctly, I have three options now (displaying wrong
counts to users is a show stopper for me, no option):

ask for a huge ".size" in the facet query options, trim to top 10
manually, and hope the issue doesn't come up anymore
use only one shard
instead of relying on counts coming from facets, run another direct
query for each of the top 10 facet values, and use those hit counts as
the facet count.

Nr. 3 means ten extra ES requests (performance hit? how big?)

Nr. 2 I'm not sure what would mean re. performance and maintenance.

Nr. 1 sounds the easiest, but rather hackish.

Not good news either way

On May 9, 12:22 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Kimchy's comment in the tickethttps://github.com/elasticsearch/elasticsearch/issues/1305makes sense. I quote it here:
"Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)"

I think the root cause of the problem is the naiive top N facet merging.
In this query:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

if you increase the facet size (currently 10) to very large, the result is correct for all entries.

I don't know if they have a plan to fix this. I've tried latest stable ES version 0.19.3, the problem remains.

LTVP

On May 9, 2012, at 4:16 PM, Phu Le wrote:

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:
https://github.com/elasticsearch/elasticsearch/issues/1305

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:#!/usr/bin/env python# -*- coding: utf-8 -*-import pyesimport cPickle - Pastebin.com
Data:http://minus.com/meJfL8TFr/(pythonpickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.

In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

I have an "author" field, which I want to be searchable using the
default analyzer (~tokens). But I also want to be able to return facet
counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?

Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}

Topic		Replies	Views
Faceting too memory hungry Elasticsearch	8	331	July 6, 2017
Facet counts and the "missing" field Elasticsearch	5	402	July 6, 2017
String is tokenized in terms facet but shouldn't be Elasticsearch	3	402	July 6, 2017
Sorting on facet results? Elasticsearch	1	234	July 6, 2017
Search for a value across multiple fields Elasticsearch	6	364	July 6, 2017

Field sorting and facets

Related topics