N:m lookup filter

Don_Clore · July 19, 2014, 2:24am

I am pretty sure this is not supported, but it'd be great to explicit
confirmation/denial.

So....document types A and B, where there's an N:M relationship between A
and B, and document type B has a list of the document A instances that
relate to it.

More concretely A == a sports Player data type, and B is a set of new
stories. The Story type has a list of the ids of Players that the story
is about/related to.

So....I know the terms lookup filter allows one to use a single document as
the source of the terms for the lookup. What we'd like to be able to do
is expose a faceted/aggregations-based UI to the user that allows her to
perform a variety of filtering operations on Players over a fairly
extensive set of criteria, and then have the resulting set of Player
document ids serve as the lookup into the Story stories, i.e., get all the
stories that relate to the Player result set.

Obviously, we'd ideally like to do this in a single query, or failing that,
have some reasonably efficient way to issue the two query/filters (passing
a large result set of ids over the wire seems like a bad idea; I'm new to
ES, but...this kind of thing was never great with Solr).

One idea I had (perhaps half-baked) was to create a PlayerResultSet type,
with an id deterministically fashioned from the query/filter predicates
such that the same user filtering action would result in the same
PlayerResultSet id each time; we'd issue a terms lookup filter request
using the PlayerResultSet id, if it fails because the PlayerResultSet
document doesn't exist, then we'd have to issue the filter for the Players,
construct a PlayerResultSet doc and index it, and query for the Stories
that have those Player Ids; not sure if it would be worse to issue all the
ids in a query, or index the PlayerResultSet doc with Refresh==true (or
issue the query and queue up the PlayerResultSet doc for later indexing, or
whatever).

The Player data should be fairly static; we could delete the documents and
recreate them each time we refresh Player data.

Ok, that sounds pretty awful, I'm hoping someone has a less Rube-Goldberg
approach; obviously, I'm sort of building in my filter query caching
mechanism, hopefully something like this can be more easily achieved with
the built-in filter caching.

thanks for any insights,
Don

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · July 19, 2014, 7:13am

Yes, I think this is somehow related to Matt's Join Filter

github.com/elastic/elasticsearch

Terms Lookup by Query/Filter (aka. Join Filter)

elastic:master ← mattweber:terms_lookup_by_query

opened 05:20PM - 01 Jul 13 UTC

mattweber

+4543 -187

This PR adds support for generating a terms filter based on the field values of …documents matching a specified lookup query/filter. The value of the configurable "path" field is collected from the field data cache for each document matching the lookup query/filter and is then used to filter the main query. This is can also be called a join filter. This PR abstracts the TermsLookup functionality in order to support multiple lookup methods. The existing functionality is moved into FieldTermsLookup and the new query based lookup is in QueryTermsLookup. All existing caching functionality works with the new query based lookup for increased performance. During testing of I found that one of the performance bottlenecks was generating the Lucene TermsFilter on large sets of terms (probably since it sorts the terms). I have created a FieldDataTermsFilter that uses the field data cache to lookup value of the field being filtered and compare it to the set of gathered terms. This significantly increased performance at the cost of higher memory usage. Currently a TermsFilter is used when the number of filtering terms is less than 1024 and the FieldDataTermsFilter is used for everything else. This should eventually be configurable or we need to perform some test to find the optimal value. Examples: Replicate a has_child query by joining on the child's "pid" field to the parent's "id" field for each child that has the tag "something". ``` curl -XPOST 'http://localhost:9200/parentIndex/_search' -d '{ "query": { "constant_score": { "filter": { "terms": { "id": { "index": "childIndex", "type": "childType", "path": "pid", "query": { "term": { "tag": "something" } } } } } } } }' ``` Lookup companies that offer products or services mentioning elasticsearch. Notice that products and services are kept in their own indices. ``` curl -XPOST 'http://localhost:9200/companies/_search' -d '{ "query": { "constant_score": { "filter": { "terms": { "company_id": { "indices": ["products", "services"], "path": "company_id", "filter": { "term": { "description": "elasticsearch" } } } } } } } }' ```

Jörg

On Sat, Jul 19, 2014 at 4:24 AM, Don Clore cloredon42@gmail.com wrote:

I am pretty sure this is not supported, but it'd be great to explicit
confirmation/denial.

So....document types A and B, where there's an N:M relationship between A
and B, and document type B has a list of the document A instances that
relate to it.

More concretely A == a sports Player data type, and B is a set of new
stories. The Story type has a list of the ids of Players that the story
is about/related to.

So....I know the terms lookup filter allows one to use a single document
as the source of the terms for the lookup. What we'd like to be able to
do is expose a faceted/aggregations-based UI to the user that allows her to
perform a variety of filtering operations on Players over a fairly
extensive set of criteria, and then have the resulting set of Player
document ids serve as the lookup into the Story stories, i.e., get all the
stories that relate to the Player result set.

Obviously, we'd ideally like to do this in a single query, or failing
that, have some reasonably efficient way to issue the two query/filters
(passing a large result set of ids over the wire seems like a bad idea; I'm
new to ES, but...this kind of thing was never great with Solr).

One idea I had (perhaps half-baked) was to create a PlayerResultSet type,
with an id deterministically fashioned from the query/filter predicates
such that the same user filtering action would result in the same
PlayerResultSet id each time; we'd issue a terms lookup filter request
using the PlayerResultSet id, if it fails because the PlayerResultSet
document doesn't exist, then we'd have to issue the filter for the Players,
construct a PlayerResultSet doc and index it, and query for the Stories
that have those Player Ids; not sure if it would be worse to issue all the
ids in a query, or index the PlayerResultSet doc with Refresh==true (or
issue the query and queue up the PlayerResultSet doc for later indexing, or
whatever).

The Player data should be fairly static; we could delete the documents and
recreate them each time we refresh Player data.

Ok, that sounds pretty awful, I'm hoping someone has a less Rube-Goldberg
approach; obviously, I'm sort of building in my filter query caching
mechanism, hopefully something like this can be more easily achieved with
the built-in filter caching.

thanks for any insights,
Don

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEMzKNuuBvuTt5XTLN6gMuePrVDP-%3DyjyQ0pWnPJ5NK9w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Don_Clore · July 26, 2014, 3:52am

Does anyone know the status of that pull request? Is it likely to be
approved?

thanks,
Don

On Saturday, July 19, 2014 12:14:01 AM UTC-7, Jörg Prante wrote:

Yes, I think this is somehow related to Matt's Join Filter

Terms Lookup by Query/Filter (aka. Join Filter) by mattweber · Pull Request #3278 · elastic/elasticsearch · GitHub

Jörg

On Sat, Jul 19, 2014 at 4:24 AM, Don Clore <clore...@gmail.com
<javascript:>> wrote:

I am pretty sure this is not supported, but it'd be great to explicit
confirmation/denial.

So....document types A and B, where there's an N:M relationship between A
and B, and document type B has a list of the document A instances that
relate to it.

More concretely A == a sports Player data type, and B is a set of new
stories. The Story type has a list of the ids of Players that the story
is about/related to.

So....I know the terms lookup filter allows one to use a single document
as the source of the terms for the lookup. What we'd like to be able to
do is expose a faceted/aggregations-based UI to the user that allows her to
perform a variety of filtering operations on Players over a fairly
extensive set of criteria, and then have the resulting set of Player
document ids serve as the lookup into the Story stories, i.e., get all the
stories that relate to the Player result set.

Obviously, we'd ideally like to do this in a single query, or failing
that, have some reasonably efficient way to issue the two query/filters
(passing a large result set of ids over the wire seems like a bad idea; I'm
new to ES, but...this kind of thing was never great with Solr).

One idea I had (perhaps half-baked) was to create a PlayerResultSet type,
with an id deterministically fashioned from the query/filter predicates
such that the same user filtering action would result in the same
PlayerResultSet id each time; we'd issue a terms lookup filter request
using the PlayerResultSet id, if it fails because the PlayerResultSet
document doesn't exist, then we'd have to issue the filter for the Players,
construct a PlayerResultSet doc and index it, and query for the Stories
that have those Player Ids; not sure if it would be worse to issue all the
ids in a query, or index the PlayerResultSet doc with Refresh==true (or
issue the query and queue up the PlayerResultSet doc for later indexing, or
whatever).

The Player data should be fairly static; we could delete the documents
and recreate them each time we refresh Player data.

Ok, that sounds pretty awful, I'm hoping someone has a less Rube-Goldberg
approach; obviously, I'm sort of building in my filter query caching
mechanism, hopefully something like this can be more easily achieved with
the built-in filter caching.

thanks for any insights,
Don

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/22ef7166-a15a-430b-b0e2-3c99285fa380%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mattweber · July 26, 2014, 5:45am

It's currently blocked until we can figure out a way to prevent a bad query
from triggering an OOM error. The goal (as far as I've been told) is to
get this in, but no ETA. I need to update the PR to the latest master as
there have been significant changes as well.

Thanks,
Matt Weber
On Jul 25, 2014 8:52 PM, "Don Clore" cloredon42@gmail.com wrote:

Does anyone know the status of that pull request? Is it likely to be
approved?

thanks,
Don

On Saturday, July 19, 2014 12:14:01 AM UTC-7, Jörg Prante wrote:

Yes, I think this is somehow related to Matt's Join Filter

Terms Lookup by Query/Filter (aka. Join Filter) by mattweber · Pull Request #3278 · elastic/elasticsearch · GitHub

Jörg

On Sat, Jul 19, 2014 at 4:24 AM, Don Clore clore...@gmail.com wrote:

I am pretty sure this is not supported, but it'd be great to explicit
confirmation/denial.

So....document types A and B, where there's an N:M relationship between
A and B, and document type B has a list of the document A instances that
relate to it.

More concretely A == a sports Player data type, and B is a set of new
stories. The Story type has a list of the ids of Players that the story
is about/related to.

So....I know the terms lookup filter allows one to use a single document
as the source of the terms for the lookup. What we'd like to be able to
do is expose a faceted/aggregations-based UI to the user that allows her to
perform a variety of filtering operations on Players over a fairly
extensive set of criteria, and then have the resulting set of Player
document ids serve as the lookup into the Story stories, i.e., get all the
stories that relate to the Player result set.

Obviously, we'd ideally like to do this in a single query, or failing
that, have some reasonably efficient way to issue the two query/filters
(passing a large result set of ids over the wire seems like a bad idea; I'm
new to ES, but...this kind of thing was never great with Solr).

One idea I had (perhaps half-baked) was to create a PlayerResultSet
type, with an id deterministically fashioned from the query/filter
predicates such that the same user filtering action would result in the
same PlayerResultSet id each time; we'd issue a terms lookup filter request
using the PlayerResultSet id, if it fails because the PlayerResultSet
document doesn't exist, then we'd have to issue the filter for the Players,
construct a PlayerResultSet doc and index it, and query for the Stories
that have those Player Ids; not sure if it would be worse to issue all the
ids in a query, or index the PlayerResultSet doc with Refresh==true (or
issue the query and queue up the PlayerResultSet doc for later indexing, or
whatever).

The Player data should be fairly static; we could delete the documents
and recreate them each time we refresh Player data.

Ok, that sounds pretty awful, I'm hoping someone has a less
Rube-Goldberg approach; obviously, I'm sort of building in my filter query
caching mechanism, hopefully something like this can be more easily
achieved with the built-in filter caching.

thanks for any insights,
Don

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/91919a48-0892-4878-890b-e14c67fd40b5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/22ef7166-a15a-430b-b0e2-3c99285fa380%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/22ef7166-a15a-430b-b0e2-3c99285fa380%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoBh6pgaH1vfzFjtukCr0emkhsMovt1rMP9x7kt7p7uPRw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Custom Plugin for specifying custom filter attributes at query time Elasticsearch	9	900	July 6, 2017
Do not know how to call it but probably it is a new (and cool!) feature request? Elasticsearch	12	455	July 6, 2017
Efficiently filtering documents based on user controls and documents' fields Elasticsearch	8	430	July 6, 2017
Elasticsearch Development: Subsets of Documents Elasticsearch	5	901	July 6, 2017
Filters vs Queries Elasticsearch	9	943	July 6, 2017

N:m lookup filter

Related topics