Elasticsearch Development: Subsets of Documents

Brook_Miller · November 4, 2014, 6:58pm

We are pretty comfortable with Lucene and have used it extensively but we'd
like to move from our existing proprietary application layer to using
ElasticSearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set of
document ids that contain the word "apple". Now we farm out to a bunch of
worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
elastic search filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and name
this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · November 4, 2014, 7:04pm

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller brook@motivequest.com wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1KG94yN%3DHh1pGFabjokK9OmGdkYXw4VKR38Y-e0KF62Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Brook_Miller · November 4, 2014, 8:00pm

Thanks for the quick response.

Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.

So for subsequent queries against the result set (1 million document ids
that we know have positive sentiment about Apple), we'd use a filter with
the 1M document ids in it (as in this)

It seems like that would work using the boolean but is a huge chunk of data
to send for all the subsequent queries and a huge workload to parse. Is
there some way to send the doc ids once, have elasticsearch calculate the
bitset and thereafter alias/name that bitset for slicing and dicing as you
put it?

-Brook

On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller <br...@motivequest.com
<javascript:>> wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

nik9000 · November 4, 2014, 8:25pm

Yeah - what I was suggesting was if that your filter was actually inside of
elasticsearch (as a plugin or whatever) then you could get elasticsearch to
automatically cache it. If the process is external then it wouldn't work.

There isn't anything that lets you build a huge id filter by sending the
ids once and caching it forever. There is a stalled pull request for a
join filter which could do the trick:

github.com/elastic/elasticsearch

Terms Lookup by Query/Filter (aka. Join Filter)

elastic:master ← mattweber:terms_lookup_by_query

opened 05:20PM - 01 Jul 13 UTC

mattweber

+4543 -187

This PR adds support for generating a terms filter based on the field values of …documents matching a specified lookup query/filter. The value of the configurable "path" field is collected from the field data cache for each document matching the lookup query/filter and is then used to filter the main query. This is can also be called a join filter. This PR abstracts the TermsLookup functionality in order to support multiple lookup methods. The existing functionality is moved into FieldTermsLookup and the new query based lookup is in QueryTermsLookup. All existing caching functionality works with the new query based lookup for increased performance. During testing of I found that one of the performance bottlenecks was generating the Lucene TermsFilter on large sets of terms (probably since it sorts the terms). I have created a FieldDataTermsFilter that uses the field data cache to lookup value of the field being filtered and compare it to the set of gathered terms. This significantly increased performance at the cost of higher memory usage. Currently a TermsFilter is used when the number of filtering terms is less than 1024 and the FieldDataTermsFilter is used for everything else. This should eventually be configurable or we need to perform some test to find the optimal value. Examples: Replicate a has_child query by joining on the child's "pid" field to the parent's "id" field for each child that has the tag "something". ``` curl -XPOST 'http://localhost:9200/parentIndex/_search' -d '{ "query": { "constant_score": { "filter": { "terms": { "id": { "index": "childIndex", "type": "childType", "path": "pid", "query": { "term": { "tag": "something" } } } } } } } }' ``` Lookup companies that offer products or services mentioning elasticsearch. Notice that products and services are kept in their own indices. ``` curl -XPOST 'http://localhost:9200/companies/_search' -d '{ "query": { "constant_score": { "filter": { "terms": { "company_id": { "indices": ["products", "services"], "path": "company_id", "filter": { "term": { "description": "elasticsearch" } } } } } } } }' ```

but it stalled and has been for while. I honestly haven't been following
it closely so I'm not sure it'd even be the right fit. Beyond that I don't
have an answer for you.

For those follow along at home you don't want to perform an update to the
document because that operation is very costly - its basically a delete and
a reindex. The deleted documents have to be merged out and the update has
to be merged up the segments and gets the same write amplification as a new
document.

Nik

On Tue, Nov 4, 2014 at 3:00 PM, Brook Miller brook@motivequest.com wrote:

Thanks for the quick response.

Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.

So for subsequent queries against the result set (1 million document ids
that we know have positive sentiment about Apple), we'd use a filter with
the 1M document ids in it (as in this)
Elasticsearch Platform — Find real-time answers at scale | Elastic

It seems like that would work using the boolean but is a huge chunk of
data to send for all the subsequent queries and a huge workload to parse.
Is there some way to send the doc ids once, have elasticsearch calculate
the bitset and thereafter alias/name that bitset for slicing and dicing as
you put it?

-Brook

On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller br...@motivequest.com
wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2%3D5hf6JZcUhzjZiFhuHxsdG0mN1tXxuUbDxhWrHEubyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Brook_Miller · November 4, 2014, 8:45pm

Thanks!

On Tuesday, November 4, 2014 10:58:44 AM UTC-8, Brook Miller wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a639377a-2685-4865-a068-86a7e40b13d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Architecture and performance question on searching small subsets of documents Elasticsearch	4	390	July 6, 2017
Filtering to a subset of the full index Elasticsearch	5	2022	July 5, 2017
Help: Is ElasticSearch the right tool for us? Elasticsearch	2	330	July 6, 2017
Extract _id's of matching documents Elasticsearch	3	720	January 7, 2019
Adding millions of documents, performance decay Elasticsearch	6	653	July 6, 2017

Elasticsearch Development: Subsets of Documents

Related topics