Field sorting and facets


(Crwe) #1

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-on-string-types-with-more-than-one-value-per-doc-or-more-than-one-token-per-field/
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(vinhphu1711) #2

Hi,

For question 1: I don't see any problem with that solution. In my opinion, 'multi_field' fits perfectly with your requirement.
For question 2: yes it's related to question 1 ;).

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-on-string-types-with-more-than-one-value-per-doc-or-more-than-one-token-per-field/
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(Crwe) #3

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(vinhphu1711) #4

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described at http://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(Crwe) #5

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(Crwe) #6

Ok, I managed to create a public version in the end.

The indexing script: http://pastebin.com/Rh47HRJx
Data: http://minus.com/meJfL8TFr/ (python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}
}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(vinhphu1711) #7

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...


curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...


curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:http://pastebin.com/Rh47HRJx
Data:http://minus.com/meJfL8TFr/(python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(vinhphu1711) #8

Kimchy's comment in the ticket https://github.com/elasticsearch/elasticsearch/issues/1305 makes sense. I quote it here:
"Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)"

I think the root cause of the problem is the naiive top N facet merging.
In this query:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

if you increase the facet size (currently 10) to very large, the result is correct for all entries.

I don't know if they have a plan to fix this. I've tried latest stable ES version 0.19.3, the problem remains.

LTVP

On May 9, 2012, at 4:16 PM, Phu Le wrote:

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:
https://github.com/elasticsearch/elasticsearch/issues/1305

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...


curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...


curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:http://pastebin.com/Rh47HRJx
Data:http://minus.com/meJfL8TFr/(python pickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(Crwe) #9

Thank you again LTVP,

very helpful link and comments.

If I understand correctly, I have three options now (displaying wrong
counts to users is a show stopper for me, no option):

  1. ask for a huge ".size" in the facet query options, trim to top 10
    manually, and hope the issue doesn't come up anymore
  2. use only one shard
  3. instead of relying on counts coming from facets, run another direct
    query for each of the top 10 facet values, and use those hit counts as
    the facet count.

Nr. 3 means ten extra ES requests (performance hit? how big?)

Nr. 2 I'm not sure what would mean re. performance and maintenance.

Nr. 1 sounds the easiest, but rather hackish.

Not good news either way :frowning:

On May 9, 12:22 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Kimchy's comment in the tickethttps://github.com/elasticsearch/elasticsearch/issues/1305makes sense. I quote it here:
"Right, the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results. The phase 3 thingy is not really a solution, will read the paper though :)"

I think the root cause of the problem is the naiive top N facet merging.
In this query:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

if you increase the facet size (currently 10) to very large, the result is correct for all entries.

I don't know if they have a plan to fix this. I've tried latest stable ES version 0.19.3, the problem remains.

LTVP

On May 9, 2012, at 4:16 PM, Phu Le wrote:

Hi Crwe,

I tried your scripts to replicate the problem. Yes, I also find it
weird. Must be something wrong in the facet computation.

I found a similar bug here:
https://github.com/elasticsearch/elasticsearch/issues/1305

I tried the following queries:

curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}'

==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 4
}
...


curl -XGET 'localhost:9200/test_index/_search?pretty=true' -d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10, "regex" :
"^Alexander.*", "regex_flags" : "DOTALL"}
}
},
"query": {
"match_all": {}
}
}'
==> this gives
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...


curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
},
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
}
}'

===> this also gives:
...
{
"term" : "Alexander Pasko",
"count" : 6
}
...

I don't understand. Someone else can explain?

LTVP

On May 8, 9:36 pm, Crwe tester.teste...@gmail.com wrote:

Ok, I managed to create a public version in the end.

The indexing script:http://pastebin.com/Rh47HRJx
Data:http://minus.com/meJfL8TFr/(pythonpickle format, 80kb)

The script uses Python ES wrapper ("$ easy_install pyes") to create
the mapping and index the data.

After indexing, the mismatch is exemplified by running:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}

}'

which returns counts for various authors. I see count 4 for "Alexander
Pasko", for example (but this differs between indexing runs).

Then when I query "Alexander Pasko" directly:

curl -XGET 'localhost:9200/test_index/test_type/_search?pretty=true' -
d '{
"query": {
"query_string": {
"query": "(author:"Alexander Pasko")"
}
}

}'

I get back 6 hits.

On May 8, 2:35 pm, Crwe tester.teste...@gmail.com wrote:

I've been trying to create a minimum failing example for the past two
hours, but failed.

When I enter the same four documents (and nothing else) into a new
testing index, the facet count matches the search hit count, like I
would expect.

I also tried deleting the index and re-indexing everything, using the
same script as before. Now the problem with "W. Ellis Penning" is gone
(facet returns count=4 as it should), but Mr. "Alexander Pasko"
returns facet count 4 while a direct query count=6.

So the issue seems to pop up randomly, with different records.

LTVP, I'll try sending you the dataset and the script privately
(unfortunately it's not public). Could you please try replicating
this? It's driving me crazy.

Apart from the dataset, let's keep the discussion here, in public.

Many thanks!

On May 8, 12:15 pm, Phu Le le.truong.vinh....@gmail.com wrote:

Hi Crwe,

Would be great if you can gist a recreation of your problem, as described athttp://www.elasticsearch.org/help/

I'm interested to investigate it further :wink:

LTVP

On May 8, 2012, at 6:10 PM, Crwe wrote:

Hello LTVP,

thanks for the quick reply!

  • In the term facet, you asked for 'author.untouched' version of 'author' field. So, no analyzer will be applied, and you get facet values on whole field.
  • In the query_string, you queried on 'author' field, which is 'author.author' version in your multi_field declaration. The standard analyzer will be applied in this case.
    So, the query input "W. Ellis Penning" will be broken into tokens: "W", "Ellis", "Penning" . So, any author whose name matches ANY of these token will be consider matched.

I don't think that's it. Firstly, the query is in double quotes "",
which I understand should search for the entire phrase.

Secondly, all four returned documents indeed contain "W. Ellis
Penning" verbatim, inside the author list. Why two of them were not
counted in the facet remains a mystery to me.

Hope this helps.

Regards,
LTVP

On May 8, 2012, at 5:40 PM, Crwe wrote:

Hello all,

I am new to ES and have two questions:

  1. I have an "author" field, which I want to be searchable using the
    default analyzer (~tokens). But I also want to be able to return facet
    counts for this field unanalyzed, as a whole string, no tokens.

I googled up this solution
http://blog.wiercinski.net/2011/uncategorized/elasticsearch-sorting-o...
mapping:

'properties': {
'author': {
"type": "multi_field",
"fields": {"author": {"type" : "string"}, "untouched" :
{"index" : "not_analyzed","type": "string"}}
}
}

And then I query with:

{
"query": {
"query_string": "some_token_from_author",
"sort": [{"author.untouched" : "asc"}]
}
}

It seems to work, but is this the right way to do it? Is there a
better way to make a field token-searchable, but also return facets
for the whole string at the same time?


Question 2 (possibly related):

The query:

{
"facets": {
"author": {
"terms": { "field": "author.untouched", "size": 10}
}
},
"query": {
"match_all": {}
}
}

returns e.g. count=2 for author value of "W. Ellis Penning". But
when I run the query:

{
"query": {
"query_string": {
"query": "(author:"W. Ellis Penning")"
}
}
}

I actually get back 4 hits?! Why the difference? Why weren't the other
two documents counted in the facet?

The author field contains an array of values like

"_source": {
"author": [
"László G.-Tóth",
"Sandra Poikane",
"W. Ellis Penning",
"Gary Free",
"Helle Mäemets",
"Agnieszka Kolada",
"Jenica Hanganu"
]
}


(system) #10