Custom IdsFilter interface for filtered queries


(Shantanu Sen) #1

Hi,

We are currently using Lucene and are exploring Elasticsearch for scaling.
We have a requirement to filter queries based on doc id and the set of docs
to be filtered can be quite large e.g. out of a corpus of 10 million
documents, user can choose a set of 5 million and run a query targeting
that subset. Hence we need to pass in a set of 5 million doc ids so that
the query can run only on those rather than the full index.

I am planning to use a mapped _id field that will be set during index
mapping and then use a filtered query with IdsFilterBuilder to generate a
filtered query. The issue is that the API takes a list of strings and hence
will not scale - ideally we would like to pass in a bit set containing all
the doc ids.

We will be using the java api. What is the best way to approach this issue?
I understand that we would need to write a custom API that will accept a
bit set. If we write a plugin, can be access the internal APIs of
Elasticsearch and hence not use the SearchRequestBuilder?

Is a plugin the right approach? Any pointers as to where to start?

Thanks,
Shantanu Sen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/663af063-525d-42f8-a2dd-a208c65a7621%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Shantanu Sen) #2

As I am looking through the code, I am thinking of the following approach

  1. Write a plugin that will accept an encoded string containing the doc ids
    instead of the array of ids
  2. Add a custom IdsFilterParser that will decode this string to a bit set
    and pass it downstream.

But it seems that the TermsFilter also needs to be customized (or a custom
TermsFilter added) as the TermsFilter.getDocIdSet is the one that needs to
be overridden/modified to generate the DocidSet from a set of doc ids
rather than from a list of TermsAndFields as it is now.

Is this the right approach? Any pointers?

Thanks,
Shantanu Sen

On Wednesday, June 11, 2014 9:26:27 PM UTC-7, Shantanu Sen wrote:

Hi,

We are currently using Lucene and are exploring Elasticsearch for scaling.
We have a requirement to filter queries based on doc id and the set of docs
to be filtered can be quite large e.g. out of a corpus of 10 million
documents, user can choose a set of 5 million and run a query targeting
that subset. Hence we need to pass in a set of 5 million doc ids so that
the query can run only on those rather than the full index.

I am planning to use a mapped _id field that will be set during index
mapping and then use a filtered query with IdsFilterBuilder to generate a
filtered query. The issue is that the API takes a list of strings and hence
will not scale - ideally we would like to pass in a bit set containing all
the doc ids.

We will be using the java api. What is the best way to approach this
issue? I understand that we would need to write a custom API that will
accept a bit set. If we write a plugin, can be access the internal APIs of
Elasticsearch and hence not use the SearchRequestBuilder?

Is a plugin the right approach? Any pointers as to where to start?

Thanks,
Shantanu Sen

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/35392141-78e9-4451-82af-08e14111a906%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Yash Datta) #3

Hi Shantanu,

It has been ages, but just wanted to know how you finally solved this problem ?

Best Regards
Yash


(Shantanu Sen) #4

We used the parent-child relationship - with the content defined on the
parent and the filters/facets defined on the child, and used that as the
filter instead of sending out the list of ids.

Shantanu


(system) #5