We are currently using Lucene and are exploring Elasticsearch for scaling.
We have a requirement to filter queries based on doc id and the set of docs
to be filtered can be quite large e.g. out of a corpus of 10 million
documents, user can choose a set of 5 million and run a query targeting
that subset. Hence we need to pass in a set of 5 million doc ids so that
the query can run only on those rather than the full index.
I am planning to use a mapped _id field that will be set during index
mapping and then use a filtered query with IdsFilterBuilder to generate a
filtered query. The issue is that the API takes a list of strings and hence
will not scale - ideally we would like to pass in a bit set containing all
the doc ids.
We will be using the java api. What is the best way to approach this issue?
I understand that we would need to write a custom API that will accept a
bit set. If we write a plugin, can be access the internal APIs of
Elasticsearch and hence not use the SearchRequestBuilder?
Is a plugin the right approach? Any pointers as to where to start?
As I am looking through the code, I am thinking of the following approach
Write a plugin that will accept an encoded string containing the doc ids
instead of the array of ids
Add a custom IdsFilterParser that will decode this string to a bit set
and pass it downstream.
But it seems that the TermsFilter also needs to be customized (or a custom
TermsFilter added) as the TermsFilter.getDocIdSet is the one that needs to
be overridden/modified to generate the DocidSet from a set of doc ids
rather than from a list of TermsAndFields as it is now.
Is this the right approach? Any pointers?
Thanks,
Shantanu Sen
On Wednesday, June 11, 2014 9:26:27 PM UTC-7, Shantanu Sen wrote:
Hi,
We are currently using Lucene and are exploring Elasticsearch for scaling.
We have a requirement to filter queries based on doc id and the set of docs
to be filtered can be quite large e.g. out of a corpus of 10 million
documents, user can choose a set of 5 million and run a query targeting
that subset. Hence we need to pass in a set of 5 million doc ids so that
the query can run only on those rather than the full index.
I am planning to use a mapped _id field that will be set during index
mapping and then use a filtered query with IdsFilterBuilder to generate a
filtered query. The issue is that the API takes a list of strings and hence
will not scale - ideally we would like to pass in a bit set containing all
the doc ids.
We will be using the java api. What is the best way to approach this
issue? I understand that we would need to write a custom API that will
accept a bit set. If we write a plugin, can be access the internal APIs of
Elasticsearch and hence not use the SearchRequestBuilder?
Is a plugin the right approach? Any pointers as to where to start?
We used the parent-child relationship - with the content defined on the
parent and the filters/facets defined on the child, and used that as the
filter instead of sending out the list of ids.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.