I have the following problem, I have a document that has a field 'xxx'
which may have duplicate values across the entire index,
I want to do a very simple thing, I want to be able to query the index
using a bool query on all my other fields,
but the results of the query should return only distinct results based on
xxx, my index simulates people, and people who live in the same house are
duplicates. I would like only to have distinct houses in my results but the
search is done across all houses
I know the duplication in advance as this is a one time index job, Is there
a trick I can do to enable this feature in elasticsearch, I was reading
around and I know that distinct is not present in elastic or lucene out of
the box
I am asking for some advanced ideas on how to make this happen, including
some clever indexing as I have full control and I also know the duplicates
in advance
I have two scenarios:
I want to count the results of a given query- needs to be very fast
I want to retrieve the actual documents - performance does not matter
As far as I know, there is no grouping functionality in ES (like Solr for
instance). However, they do have a nifty parent/child feature you may want
to take a look at. You could index the people as children of house objects,
and then make use of the "has_child" query. This will return only parent
items that have matching properties in their children. I hope that helps.
Best,
Jorge
On Tuesday, June 18, 2013 4:17:35 PM UTC-4, David MZ wrote:
I have the following problem, I have a document that has a field 'xxx'
which may have duplicate values across the entire index,
I want to do a very simple thing, I want to be able to query the index
using a bool query on all my other fields,
but the results of the query should return only distinct results based
on xxx, my index simulates people, and people who live in the same house
are duplicates. I would like only to have distinct houses in my results but
the search is done across all houses
I know the duplication in advance as this is a one time index job, Is
there a trick I can do to enable this feature in elasticsearch, I was
reading around and I know that distinct is not present in elastic or lucene
out of the box
I am asking for some advanced ideas on how to make this happen, including
some clever indexing as I have full control and I also know the duplicates
in advance
I have two scenarios:
I want to count the results of a given query- needs to be very fast
I want to retrieve the actual documents - performance does not matter
As far as I know, there is no grouping functionality in ES (like Solr for
instance). However, they do have a nifty parent/child feature you may want
to take a look at. You could index the people as children of house objects,
and then make use of the "has_child" query. This will return only parent
items that have matching properties in their children. I hope that helps.
Best,
Jorge
On Tuesday, June 18, 2013 4:17:35 PM UTC-4, David MZ wrote:
I have the following problem, I have a document that has a field 'xxx'
which may have duplicate values across the entire index,
I want to do a very simple thing, I want to be able to query the index
using a bool query on all my other fields,
but the results of the query should return only distinct results based
on xxx, my index simulates people, and people who live in the same house
are duplicates. I would like only to have distinct houses in my results but
the search is done across all houses
I know the duplication in advance as this is a one time index job, Is
there a trick I can do to enable this feature in elasticsearch, I was
reading around and I know that distinct is not present in elastic or lucene
out of the box
I am asking for some advanced ideas on how to make this happen, including
some clever indexing as I have full control and I also know the duplicates
in advance
I have two scenarios:
I want to count the results of a given query- needs to be very fast
I want to retrieve the actual documents - performance does not matter
It does not, it only returns parents. There is a "has_parent" query which
is the mirror of this, or you could just simply gather up the IDs and
retrieve the children directly.
Also check out "top_children", which seems like it is similar to has_child,
but may be more performant as it seems it does not need to traverse the
entire index.
Best of luck,
Jorge
On Tuesday, June 18, 2013 4:17:35 PM UTC-4, David MZ wrote:
I have the following problem, I have a document that has a field 'xxx'
which may have duplicate values across the entire index,
I want to do a very simple thing, I want to be able to query the index
using a bool query on all my other fields,
but the results of the query should return only distinct results based
on xxx, my index simulates people, and people who live in the same house
are duplicates. I would like only to have distinct houses in my results but
the search is done across all houses
I know the duplication in advance as this is a one time index job, Is
there a trick I can do to enable this feature in elasticsearch, I was
reading around and I know that distinct is not present in elastic or lucene
out of the box
I am asking for some advanced ideas on how to make this happen, including
some clever indexing as I have full control and I also know the duplicates
in advance
I have two scenarios:
I want to count the results of a given query- needs to be very fast
I want to retrieve the actual documents - performance does not matter
My issue is that I need to simulate distinct, so I "need" to know which of
the children has triggered the parent, so I can include it into the results
as two children may have the same parent
the has_parent query won't give me a distinct answer
It does not, it only returns parents. There is a "has_parent" query which
is the mirror of this, or you could just simply gather up the IDs and
retrieve the children directly.
Also check out "top_children", which seems like it is similar to
has_child, but may be more performant as it seems it does not need to
traverse the entire index.
Best of luck,
Jorge
On Tuesday, June 18, 2013 4:17:35 PM UTC-4, David MZ wrote:
I have the following problem, I have a document that has a field 'xxx'
which may have duplicate values across the entire index,
I want to do a very simple thing, I want to be able to query the index
using a bool query on all my other fields,
but the results of the query should return only distinct results based
on xxx, my index simulates people, and people who live in the same house
are duplicates. I would like only to have distinct houses in my results but
the search is done across all houses
I know the duplication in advance as this is a one time index job, Is
there a trick I can do to enable this feature in elasticsearch, I was
reading around and I know that distinct is not present in elastic or lucene
out of the box
I am asking for some advanced ideas on how to make this happen, including
some clever indexing as I have full control and I also know the duplicates
in advance
I have two scenarios:
I want to count the results of a given query- needs to be very fast
I want to retrieve the actual documents - performance does not matter
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.