Hi,
A little clarification:
Assume sample data set of 50M documents. The documents need to be filtered
by a field, Field1. However, at indexing time, this field is NOT written to
the document in Lucene through ES. Field1 is a frequently changing field
and hence, we will like to maintain it outside.
(This following paragraph can be skipped.)
Now assume that there are a few such fields, Field1, ..., FieldN. For every
document in the corpus, the value for Field1 may be from a pool of 100-odd
values. Thus, for example, at max, FIeld1 can hold 1M documents that
correspond to one of the 100-dd values, and at the fag-end, can probably
correspond to 10 documents as well.
(Continue reading)
I would, at system startup time, make sure that I have loaded all relevant
BitSets that I plan to use for any Filters in memory, so that my cache
framework is warm and I can lookup the relevant filter values for a
particular query from this cache at query run time. The mechanisms for this
loading are still unknown, but please assume that this BitSet will be
available readily during query time.
This BitSet will correspond to the DocIDs in Lucene for a particular value
of Field1 that I want to filter. I plan to create a Filter class overridden
in Lucene that will accept this DocIdSet.
What I am unable to understand is how I can achieve this in ES? Now, I have
been exploring the different mail threads on this forum, and it seems that
certain plugins can achieve this. Please see the list below that I could
find on this forum.
Can you please tell me how an IndexQueryParserModule will serve my use
case? If you can provide some pointers on writing a plugin that can
leverage a CustomFilter, that will be immensely helpful. Thanks,
https://groups.google.com/forum/?fromgroups=#!searchin/elasticsearch/IndexQueryParserModule$20Plugin/elasticsearch/5Gqxx3UvN2s/FL4Lb2RxQt0J
2. Redirecting to Google Groups
3. Plugins: Allow to easily plug a custom DSL query/filter parsers · Issue #208 · elastic/elasticsearch · GitHub
4.
http://elasticsearch-users.115913.n3.nabble.com/custom-filter-handler-plugin-td4051973.html
Thanks,
Sandeep
On Mon, Jul 7, 2014 at 2:17 AM, joergprante@gmail.com <joergprante@gmail.com
wrote:
Thanks for being so patient with me
I understand now the following: there are 50m of documents in an external
DB, from which up to 1m is to be exported in form of document identifiers
to work as a filter in ES. The idea is to use internal mechanisms like bit
sets. There is no API for manipulating filters in ES on that level, ES
receives the terms and passes them into Lucene TermFilter class according
to the type of the filter.
What is a bit unclear to me: how is the filter set constructed? I assume
it should be a select statement on the database?
Next, if you have this large set of document identifiers selected, I do
not understand what is the base query you want to apply the filter on? Is
there a user given query for ES? How does such query looks like? Is it
assumed there are other documents in ES that are related somehow to the 50m
documents? An illustrative example of the steps in the scenario would
really help to understand the data model.
Just some food for thought: it is close to impossible to filter in ES on
1m unique terms with a single step - the default setting of maximum clauses
in a Lucene Query is for good reason limited to 1024 terms. A workaround
would be iterating over 1m terms and execute 1000 filter queries and add up
the results. This takes a long time and may not be the desired solution.
Fortunately, in most situations, it is possible to find more concise
grouping to reduce the 1m document identifiers into fewer ones for more
efficient filtering.
Jörg
On Sun, Jul 6, 2014 at 9:39 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasticsearch@googlegroups.com wrote:
Hi,
Appreciate your continued assistance. Thanks,
Disclaimer: I am yet to sufficiently understand ES sources so as to
depict my scenario completely. Some info' below may be conjecture.
I would have a corpus of 50M docs (actually lot more, but for testing
now) out of which I would have say, upto, 1M DocIds to be used as a filter.
This set of 1M docs can be different for different use cases, the point
being, upto 1M docIds can form one logical set of documents for filtering
results. If I use a simple IdsFilter from ES Java API, I would have to keep
adding these 1M docs to the List implementation internally, and I have a
feeling it may not scale very well as they may change per use case and per
some combinations internal to a single use case also.
As I debug the code, the IdsFilter will be converted to a Lucene filter.
Lucene filters, on the other hand, operate on a docId bitset type. That
gels very well with my requirement, since I can scale with BitSets (I
assume).
If I can find a way to directly plug this BitSet as a Lucene Filter to
the Lucene search() call bypassing the ES filters using, I dont know, may
some sort of a plugin, I believe that may support my cause. I assume I may
not get to use the Filter cache from ES but probably I can cache these
BitSets for subsequent use.
Please let me know. And thanks!
Thanks,
Sandeep
On Saturday, 5 July 2014 01:40:55 UTC+5:30, Jörg Prante wrote:
What I understand is a TermsFilter is required
Elasticsearch Platform — Find real-time answers at scale | Elastic
reference/current/query-dsl-terms-filter.html
and the source of the terms is a DB. That is no problem. The plan is:
fetch the terms from the DB, build the query (either Java API or JSON) and
execute it.
What I don't understand is the part with the "quick mapping", Lucene,
and the doc ids. Lucene doc IDs are not reliable and are not exposed by
Elasticsearch, Elasticsearch uses it's own document identifiers which are
stable and augmented with info about the index type they belong to, in
order to make them unique. But I do not understand why this is important in
this context.
Elasticsearch API uses query builders and filter builders to build
search requests . A "quick mapping" is just fetching the terms from the DB
as a string array before this API is called.
I also do not understand the role of the number "1M", is this the number
of fields, or the number of terms? Is it a total number or a number per
query?
Did I misunderstand anything more? I am not really sure what is the
challenge...
Jörg
On Fri, Jul 4, 2014 at 8:55 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:
Hi,
Just to give some background. I will have a large-ish corpus of more
than 100M documents indexed. The filters that I want to apply will be on a
field that is not indexed. I mean, I prefer to not have them indexed in
ES/Lucene since they will be frequently changing. So, for that, I will be
maintaining them elsewhere, like a DB etc.
Everytime I have a query, I would want to filter the results by those
fields that are not indexed in Lucene. And I am guessing that number may
well be more than 1M. In that case, I think, since we will maintain some
sort of TermsFilter, it may not scale linearly. What I would want to do,
preferably, is to have a hook inside the ES query, so that I can, at query
time, inject the required filter values. Since the filter values have to be
recognized by Lucene, and I will not be indexing them, I will need to do
some quick mapping to get those fields and map them quickly to some field
in Lucene that I can save in the filter. I am not sure whether we can
access and set Lucene DocIDs in the filter or whether they are even exposed
in ES.
Please assist with this query. Thanks,
Thanks,
Sandeep
On Thursday, 3 July 2014 21:33:45 UTC+5:30, Jörg Prante wrote:
Maybe I do not fully understand, but in a client, you can fetch the
required filter terms from any external source before a JSON query is
constructed?
Can you give an example what you want to achieve?
Jörg
On Thu, Jul 3, 2014 at 3:34 PM, 'Sandeep Ramesh Khanzode' via
elasticsearch elasti...@googlegroups.com wrote:
Hi All,
I am new to ES and I have the following requirement:
I need to specify a list of strings as a filter that applies to a
specific field in the document. Like what a filter does, but instead of
sending them on the query, I would like them to be populated from an
external sources, like a DB or something. Can you please guide me to the
relevant examples or references to achieve this on v1.1.2?
Thanks,
Sandeep
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/0093d97d-0f47-48e9-ba19-85b0850eda89%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/513172cd-9507-4e96-b456-498c98c3b8c9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/f2ec45c7-8980-4005-9e1b-fc9a6aa422e0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/MB0ThaJRmKE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoExZ6eMfi%2BfDx9_fRUnmtDEs64p5yX%2BE5Mk_GgR2bR58A%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKnM90Zbh-WDaDZg-GTVHk3B0yv5XXBW49nbrt6WcHzA%2BTUk_A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.