Question regarding field collapsing and sharding


(Jean Sebastien Vachon) #1

Hi all,

I am a total newbie with ES but got a very good experience with Solr and Lucene. I am currently playing with ES to see if it has the same issues/limitations as Solr mainly regarding field collapsing and sharding.

In Solr, if I want to group on a certain field (and get good numbers), I need to route documents with the same doc-value for a field to the same shard. (i.e all documents with value X for field F need to be on the same shard).

Is that true for ES as well? I tried following the example from the doc regarding the top_hits aggregator and a sub aggregator like this:

https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-aggregations-metrics-top-hits-aggregation.html#_field_collapse_example

So far my request looks like this. (Basically get a list of employers sorted by the number of places/cities):

GET _search
{
"size": 0,
"query": {
"match": {
"description": "java"
}
},
"aggs": {
"top_employers": {
"terms": {
"field": "employer_id",
"order": {
"empl-place": "desc"
},
"size": 1
},
"aggs": {
"top_employer_hits": {
"top_hits": {
"_source": [
"place_id",
"content_id"
],"size":10
}
},
"empl-place": {
"value_count": {
"field": "place_id"
}
}
}
}
}
}

and the result I am getting is ( I removed to stuff for clarity):

{
...
"aggregations": {
"top_employers": {
...
"buckets": [
{
...
"top_employer_hits": {
"hits": {
"total": 4,
"max_score": 0.9461352,
"hits": [
{
...
"_source": {
"content_id": "768474767",
"place_id": 485285
}
},
{
...
"_source": {
"content_id": "768474767",
"place_id": 485285
}
},
{
...
"_source": {
"content_id": "763490271",
"place_id": 0
}
},
{
....
"_source": {
"content_id": "768473591",
"place_id": 485285
}
}
]
}
}
}
]
}
}
}

As you can see I am receiving multiple hits with the same value for the place_id field. Could that be caused by the data not being distributed correctly?

Thanks


(system) #2