Custom Plugin for specifying custom filter attributes at query time

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a specific
field in the document. Like what a filter does, but instead of sending them
on the query, I would like them to be populated from an external sources,
like a DB or something. Can you please guide me to the relevant examples or
references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via elasticsearch
elasticsearch@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a specific
field in the document. Like what a filter does, but instead of sending them
on the query, I would like them to be populated from an external sources,
like a DB or something. Can you please guide me to the relevant examples or
references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHuPeMvrZY4vTd9EyaS0HxFS_OhitT0rQq-RDLwDrS7Ag%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Just to give some background. I will have a large-ish corpus of more than
100M documents indexed. The filters that I want to apply will be on a field
that is not indexed. I mean, I prefer to not have them indexed in ES/Lucene
since they will be frequently changing. So, for that, I will be maintaining
them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch <elasti...@googlegroups.com <javascript:>> wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

What I understand is a TermsFilter is required

and the source of the terms is a DB. That is no problem. The plan is: fetch
the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene, and
the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build search
requests . A "quick mapping" is just fetching the terms from the DB as a
string array before this API is called.

I also do not understand the role of the number "1M", is this the number of
fields, or the number of terms? Is it a total number or a number per query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via elasticsearch
elasticsearch@googlegroups.com wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more than
100M documents indexed. The filters that I want to apply will be on a field
that is not indexed. I mean, I prefer to not have them indexed in ES/Lucene
since they will be frequently changing. So, for that, I will be maintaining
them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_AgzccaxHDf%2Bdq-%3DuPb1kDYGQDu%3DTaa_-8bDi344ung%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Appreciate your continued assistance. :slight_smile: Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to depict
my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing now)
out of which I would have say, upto, 1M DocIds to be used as a filter. This
set of 1M docs can be different for different use cases, the point being,
upto 1M docIds can form one logical set of documents for filtering results.
If I use a simple IdsFilter from ES Java API, I would have to keep adding
these 1M docs to the List implementation internally, and I have a feeling
it may not scale very well as they may change per use case and per some
combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene filter.
Lucene filters, on the other hand, operate on a docId bitset type. That
gels very well with my requirement, since I can scale with BitSets (I
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to the
Lucene search() call bypassing the ES filters using, I dont know, may some
sort of a plugin, I believe that may support my cause. I assume I may not
get to use the Filter cache from ES but probably I can cache these BitSets
for subsequent use.

Please let me know. And thanks!

Thanks,
Sandeep

On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:

What I understand is a TermsFilter is required

Elasticsearch Platform — Find real-time answers at scale | Elastic

and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene, and
the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build search
requests . A "quick mapping" is just fetching the terms from the DB as a
string array before this API is called.

I also do not understand the role of the number "1M", is this the number
of fields, or the number of terms? Is it a total number or a number per
query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch <elasti...@googlegroups.com <javascript:>> wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more than
100M documents indexed. The filters that I want to apply will be on a field
that is not indexed. I mean, I prefer to not have them indexed in ES/Lucene
since they will be frequently changing. So, for that, I will be maintaining
them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for being so patient with me :slight_smile:

I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume it
should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do not
understand what is the base query you want to apply the filter on? Is there
a user given query for ES? How does such query looks like? Is it assumed
there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on 1m
unique terms with a single step - the default setting of maximum clauses in
a Lucene Query is for good reason limited to 1024 terms. A workaround would
be iterating over 1m terms and execute 1000 filter queries and add up the
results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg

On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via elasticsearch
elasticsearch@googlegroups.com wrote:

Hi,

Appreciate your continued assistance. :slight_smile: Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to depict
my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing now)
out of which I would have say, upto, 1M DocIds to be used as a filter. This
set of 1M docs can be different for different use cases, the point being,
upto 1M docIds can form one logical set of documents for filtering results.
If I use a simple IdsFilter from ES Java API, I would have to keep adding
these 1M docs to the List implementation internally, and I have a feeling
it may not scale very well as they may change per use case and per some
combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene filter.
Lucene filters, on the other hand, operate on a docId bitset type. That
gels very well with my requirement, since I can scale with BitSets (I
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to the
Lucene search() call bypassing the ES filters using, I dont know, may some
sort of a plugin, I believe that may support my cause. I assume I may not
get to use the Filter cache from ES but probably I can cache these BitSets
for subsequent use.

Please let me know. And thanks!

Thanks,
Sandeep

On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:

What I understand is a TermsFilter is required

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/query-dsl-terms-filter.html

and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene, and
the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build search
requests . A "quick mapping" is just fetching the terms from the DB as a
string array before this API is called.

I also do not understand the role of the number "1M", is this the number
of fields, or the number of terms? Is it a total number or a number per
query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more
than 100M documents indexed. The filters that I want to apply will be on a
field that is not indexed. I mean, I prefer to not have them indexed in
ES/Lucene since they will be frequently changing. So, for that, I will be
maintaining them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

A little clarification:

Assume sample data set of 50M documents. The documents need to be filtered
by a field, Field1. However, at indexing time, this field is NOT written to
the document in Lucene through ES. Field1 is a frequently changing field
and hence, we will like to maintain it outside.

(This following paragraph can be skipped.)
Now assume that there are a few such fields, Field1, ..., FieldN. For every
document in the corpus, the value for Field1 may be from a pool of 100-odd
values. Thus, for example, at max, FIeld1 can hold 1M documents that
correspond to one of the 100-dd values, and at the fag-end, can probably
correspond to 10 documents as well.

(Continue reading) :slight_smile:
I would, at system startup time, make sure that I have loaded all relevant
BitSets that I plan to use for any Filters in memory, so that my cache
framework is warm and I can lookup the relevant filter values for a
particular query from this cache at query run time. The mechanisms for this
loading are still unknown, but please assume that this BitSet will be
available readily during query time.

This BitSet will correspond to the DocIDs in Lucene for a particular value
of Field1 that I want to filter. I plan to create a Filter class overridden
in Lucene that will accept this DocIdSet.

What I am unable to understand is how I can achieve this in ES? Now, I have
been exploring the different mail threads on this forum, and it seems that
certain plugins can achieve this. Please see the list below that I could
find on this forum.

Can you please tell me how an IndexQueryParserModule will serve my use
case? If you can provide some pointers on writing a plugin that can
leverage a CustomFilter, that will be immensely helpful. Thanks,

https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J
2. Redirecting to Google Groups
3. Plugins: Allow to easily plug a custom DSL query/filter parsers · Issue #208 · elastic/elasticsearch · GitHub
4.
http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html

Thanks,
Sandeep

On Mon, Jul 7, 2014 at 2:17 AM, joergprante@gmail.com <joergprante@gmail.com

wrote:

Thanks for being so patient with me :slight_smile:

I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume
it should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do
not understand what is the base query you want to apply the filter on? Is
there a user given query for ES? How does such query looks like? Is it
assumed there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on
1m unique terms with a single step - the default setting of maximum clauses
in a Lucene Query is for good reason limited to 1024 terms. A workaround
would be iterating over 1m terms and execute 1000 filter queries and add up
the results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg

On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasticsearch@googlegroups.com wrote:

Hi,

Appreciate your continued assistance. :slight_smile: Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to
depict my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing
now) out of which I would have say, upto, 1M DocIds to be used as a filter.
This set of 1M docs can be different for different use cases, the point
being, upto 1M docIds can form one logical set of documents for filtering
results. If I use a simple IdsFilter from ES Java API, I would have to keep
adding these 1M docs to the List implementation internally, and I have a
feeling it may not scale very well as they may change per use case and per
some combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene filter.
Lucene filters, on the other hand, operate on a docId bitset type. That
gels very well with my requirement, since I can scale with BitSets (I
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to
the Lucene search() call bypassing the ES filters using, I dont know, may
some sort of a plugin, I believe that may support my cause. I assume I may
not get to use the Filter cache from ES but probably I can cache these
BitSets for subsequent use.

Please let me know. And thanks!

Thanks,
Sandeep

On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:

What I understand is a TermsFilter is required

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/query-dsl-terms-filter.html

and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene,
and the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build
search requests . A "quick mapping" is just fetching the terms from the DB
as a string array before this API is called.

I also do not understand the role of the number "1M", is this the number
of fields, or the number of terms? Is it a total number or a number per
query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more
than 100M documents indexed. The filters that I want to apply will be on a
field that is not indexed. I mean, I prefer to not have them indexed in
ES/Lucene since they will be frequently changing. So, for that, I will be
maintaining them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

In Elasticsearch, you can extend the existing queries and filters, by a
plugin, with the help of addQuery/addFilter at IndexQueryParserModule

Each query or filter comes in a pair of classes, a builder and a parser.

A filter builder manages the syntax, the content serialization with the
help of XContent classes for inner/outer representation of filter
specification.

A filter parser parses such a structure and turns it into a Lucene Filter
for internal processing.

So one approach would be to look at your bit set implementation how this
can be turned into a Lucene Filter. An instructive example where to start
from is
in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder

An example where terms from fielddata cache are read and turned into a
filter is org.elasticsearch.index.search.FielddataTermsFilter

A key line is the method

public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs)
throws IOException

An example for caching filters
is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache
(the caching of filters in ES is done with Guava's cache classes)

Also, it could be helpful to study helper classes in this context like in
package org.elasticsearch.common.lucene.docset

I am not aware of a filter plugin yet but it is possible that I could
sketch a demo filter plugin source code on github.

Jörg

On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode <
k.sandeep.r@gmail.com> wrote:

Hi,

A little clarification:

Assume sample data set of 50M documents. The documents need to be filtered
by a field, Field1. However, at indexing time, this field is NOT written to
the document in Lucene through ES. Field1 is a frequently changing field
and hence, we will like to maintain it outside.

(This following paragraph can be skipped.)
Now assume that there are a few such fields, Field1, ..., FieldN. For
every document in the corpus, the value for Field1 may be from a pool of
100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents
that correspond to one of the 100-dd values, and at the fag-end, can
probably correspond to 10 documents as well.

(Continue reading) :slight_smile:
I would, at system startup time, make sure that I have loaded all relevant
BitSets that I plan to use for any Filters in memory, so that my cache
framework is warm and I can lookup the relevant filter values for a
particular query from this cache at query run time. The mechanisms for this
loading are still unknown, but please assume that this BitSet will be
available readily during query time.

This BitSet will correspond to the DocIDs in Lucene for a particular value
of Field1 that I want to filter. I plan to create a Filter class overridden
in Lucene that will accept this DocIdSet.

What I am unable to understand is how I can achieve this in ES? Now, I
have been exploring the different mail threads on this forum, and it seems
that certain plugins can achieve this. Please see the list below that I
could find on this forum.

Can you please tell me how an IndexQueryParserModule will serve my use
case? If you can provide some pointers on writing a plugin that can
leverage a CustomFilter, that will be immensely helpful. Thanks,

Redirecting to Google Groups
2. Redirecting to Google Groups
3. Plugins: Allow to easily plug a custom DSL query/filter parsers · Issue #208 · elastic/elasticsearch · GitHub
4.
http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html

Thanks,
Sandeep

On Mon, Jul 7, 2014 at 2:17 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Thanks for being so patient with me :slight_smile:

I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume
it should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do
not understand what is the base query you want to apply the filter on? Is
there a user given query for ES? How does such query looks like? Is it
assumed there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on
1m unique terms with a single step - the default setting of maximum clauses
in a Lucene Query is for good reason limited to 1024 terms. A workaround
would be iterating over 1m terms and execute 1000 filter queries and add up
the results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg

On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasticsearch@googlegroups.com wrote:

Hi,

Appreciate your continued assistance. :slight_smile: Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to
depict my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing
now) out of which I would have say, upto, 1M DocIds to be used as a filter.
This set of 1M docs can be different for different use cases, the point
being, upto 1M docIds can form one logical set of documents for filtering
results. If I use a simple IdsFilter from ES Java API, I would have to keep
adding these 1M docs to the List implementation internally, and I have a
feeling it may not scale very well as they may change per use case and per
some combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene filter.
Lucene filters, on the other hand, operate on a docId bitset type. That
gels very well with my requirement, since I can scale with BitSets (I
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to
the Lucene search() call bypassing the ES filters using, I dont know, may
some sort of a plugin, I believe that may support my cause. I assume I may
not get to use the Filter cache from ES but probably I can cache these
BitSets for subsequent use.

Please let me know. And thanks!

Thanks,
Sandeep

On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:

What I understand is a TermsFilter is required

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/query-dsl-terms-filter.html

and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene,
and the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build
search requests . A "quick mapping" is just fetching the terms from the DB
as a string array before this API is called.

I also do not understand the role of the number "1M", is this the
number of fields, or the number of terms? Is it a total number or a number
per query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more
than 100M documents indexed. The filters that I want to apply will be on a
field that is not indexed. I mean, I prefer to not have them indexed in
ES/Lucene since they will be frequently changing. So, for that, I will be
maintaining them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4
7-48e9-ba19-85b0850eda89%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFz87mixh0OK-ci_6SH6hd%3D7BzGwBVSKAfXt-XRvXSi6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

Sure. Thanks a lot for the helpful pointers. I will take a look at the
classes and create a plugin. If there are any gotcha's or certain ways of
doing things in this plugin, please tell me so that I can take note.

It seems that the plugin would be small with just the Parser/Builder and a
simple plugin that uses the IndexQueryParserModule to add a process for a
XContentFilterParser.

I have not checked the Cache classes, and will take a look there as well.

I do have a followup question:
If I have to be aware of the shards in my filter implementation, is that
possible? I mean, if during indexing, there is a routing policy (not sure
if there is something like that) that directs documents to a particular
shard based on certain range or function or hash (please let me know if
there is something like that, as I need to check that implementation as
well), then during query time, I would want the filter to only be created
for the DocIds that correspond to the shard where the query will execute.
Seems like a problem that is not unusual.

Please tell me if this is possible. Thanks again,

Thanks,
Sandeep

On Tuesday, 8 July 2014 02:01:35 UTC+5:30, Jörg Prante wrote:

In Elasticsearch, you can extend the existing queries and filters, by a
plugin, with the help of addQuery/addFilter at IndexQueryParserModule

Each query or filter comes in a pair of classes, a builder and a parser.

A filter builder manages the syntax, the content serialization with the
help of XContent classes for inner/outer representation of filter
specification.

A filter parser parses such a structure and turns it into a Lucene Filter
for internal processing.

So one approach would be to look at your bit set implementation how this
can be turned into a Lucene Filter. An instructive example where to start
from is
in org.elasticsearch.index.query.TermsFilterParser/TermsFilterBuilder

An example where terms from fielddata cache are read and turned into a
filter is org.elasticsearch.index.search.FielddataTermsFilter

A key line is the method

public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs)
throws IOException

An example for caching filters
is org.elasticsearch.indices.cache.filter.terms.IndicesTermsFilterCache
(the caching of filters in ES is done with Guava's cache classes)

Also, it could be helpful to study helper classes in this context like in
package org.elasticsearch.common.lucene.docset

I am not aware of a filter plugin yet but it is possible that I could
sketch a demo filter plugin source code on github.

Jörg

On Mon, Jul 7, 2014 at 3:49 PM, Sandeep Ramesh Khanzode <
k.san...@gmail.com <javascript:>> wrote:

Hi,

A little clarification:

Assume sample data set of 50M documents. The documents need to be
filtered by a field, Field1. However, at indexing time, this field is NOT
written to the document in Lucene through ES. Field1 is a frequently
changing field and hence, we will like to maintain it outside.

(This following paragraph can be skipped.)
Now assume that there are a few such fields, Field1, ..., FieldN. For
every document in the corpus, the value for Field1 may be from a pool of
100-odd values. Thus, for example, at max, FIeld1 can hold 1M documents
that correspond to one of the 100-dd values, and at the fag-end, can
probably correspond to 10 documents as well.

(Continue reading) :slight_smile:
I would, at system startup time, make sure that I have loaded all
relevant BitSets that I plan to use for any Filters in memory, so that my
cache framework is warm and I can lookup the relevant filter values for a
particular query from this cache at query run time. The mechanisms for this
loading are still unknown, but please assume that this BitSet will be
available readily during query time.

This BitSet will correspond to the DocIDs in Lucene for a particular
value of Field1 that I want to filter. I plan to create a Filter class
overridden in Lucene that will accept this DocIdSet.

What I am unable to understand is how I can achieve this in ES? Now, I
have been exploring the different mail threads on this forum, and it seems
that certain plugins can achieve this. Please see the list below that I
could find on this forum.

Can you please tell me how an IndexQueryParserModule will serve my use
case? If you can provide some pointers on writing a plugin that can
leverage a CustomFilter, that will be immensely helpful. Thanks,

Redirecting to Google Groups
2. Redirecting to Google Groups
3. Plugins: Allow to easily plug a custom DSL query/filter parsers · Issue #208 · elastic/elasticsearch · GitHub
4.
http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html

Thanks,
Sandeep

On Mon, Jul 7, 2014 at 2:17 AM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

Thanks for being so patient with me :slight_smile:

I understand now the following: there are 50m of documents in an
external DB, from which up to 1m is to be exported in form of document
identifiers to work as a filter in ES. The idea is to use internal
mechanisms like bit sets. There is no API for manipulating filters in ES on
that level, ES receives the terms and passes them into Lucene TermFilter
class according to the type of the filter.

What is a bit unclear to me: how is the filter set constructed? I assume
it should be a select statement on the database?

Next, if you have this large set of document identifiers selected, I do
not understand what is the base query you want to apply the filter on? Is
there a user given query for ES? How does such query looks like? Is it
assumed there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.

Just some food for thought: it is close to impossible to filter in ES on
1m unique terms with a single step - the default setting of maximum clauses
in a Lucene Query is for good reason limited to 1024 terms. A workaround
would be iterating over 1m terms and execute 1000 filter queries and add up
the results. This takes a long time and may not be the desired solution.

Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.

Jörg

On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch <elasti...@googlegroups.com <javascript:>> wrote:

Hi,

Appreciate your continued assistance. :slight_smile: Thanks,

Disclaimer: I am yet to sufficiently understand ES sources so as to
depict my scenario completely. Some info' below may be conjecture.

I would have a corpus of 50M docs (actually lot more, but for testing
now) out of which I would have say, upto, 1M DocIds to be used as a filter.
This set of 1M docs can be different for different use cases, the point
being, upto 1M docIds can form one logical set of documents for filtering
results. If I use a simple IdsFilter from ES Java API, I would have to keep
adding these 1M docs to the List implementation internally, and I have a
feeling it may not scale very well as they may change per use case and per
some combinations internal to a single use case also.

As I debug the code, the IdsFilter will be converted to a Lucene
filter. Lucene filters, on the other hand, operate on a docId bitset type.
That gels very well with my requirement, since I can scale with BitSets (I
assume).

If I can find a way to directly plug this BitSet as a Lucene Filter to
the Lucene search() call bypassing the ES filters using, I dont know, may
some sort of a plugin, I believe that may support my cause. I assume I may
not get to use the Filter cache from ES but probably I can cache these
BitSets for subsequent use.

Please let me know. And thanks!

Thanks,
Sandeep

On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:

What I understand is a TermsFilter is required

Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/query-dsl-terms-filter.html

and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.

What I don't understand is the part with the "quick mapping", Lucene,
and the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.

Elasticsearch API uses query builders and filter builders to build
search requests . A "quick mapping" is just fetching the terms from the DB
as a string array before this API is called.

I also do not understand the role of the number "1M", is this the
number of fields, or the number of terms? Is it a total number or a number
per query?

Did I misunderstand anything more? I am not really sure what is the
challenge...

Jörg

On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi,

Just to give some background. I will have a large-ish corpus of more
than 100M documents indexed. The filters that I want to apply will be on a
field that is not indexed. I mean, I prefer to not have them indexed in
ES/Lucene since they will be frequently changing. So, for that, I will be
maintaining them elsewhere, like a DB etc.

Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.

Please assist with this query. Thanks,

Thanks,
Sandeep

On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:

Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?

Can you give an example what you want to achieve?

Jörg

On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:

Hi All,

I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?

Thanks,
Sandeep

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f4
7-48e9-ba19-85b0850eda89%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/30f3d287-752d-4b2e-8a9d-4ba216a514d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.