Elasticsearch Development: Subsets of Documents

We are pretty comfortable with Lucene and have used it extensively but we'd
like to move from our existing proprietary application layer to using
ElasticSearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set of
document ids that contain the word "apple". Now we farm out to a bunch of
worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
elastic search filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and name
this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller brook@motivequest.com wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd1KG94yN%3DHh1pGFabjokK9OmGdkYXw4VKR38Y-e0KF62Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the quick response.

Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.

So for subsequent queries against the result set (1 million document ids
that we know have positive sentiment about Apple), we'd use a filter with
the 1M document ids in it (as in this)

It seems like that would work using the boolean but is a huge chunk of data
to send for all the subsequent queries and a huge workload to parse. Is
there some way to send the doc ids once, have elasticsearch calculate the
bitset and thereafter alias/name that bitset for slicing and dicing as you
put it?

-Brook

On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller <br...@motivequest.com
<javascript:>> wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yeah - what I was suggesting was if that your filter was actually inside of
elasticsearch (as a plugin or whatever) then you could get elasticsearch to
automatically cache it. If the process is external then it wouldn't work.

There isn't anything that lets you build a huge id filter by sending the
ids once and caching it forever. There is a stalled pull request for a
join filter which could do the trick:

but it stalled and has been for while. I honestly haven't been following
it closely so I'm not sure it'd even be the right fit. Beyond that I don't
have an answer for you.

For those follow along at home you don't want to perform an update to the
document because that operation is very costly - its basically a delete and
a reindex. The deleted documents have to be merged out and the update has
to be merged up the segments and gets the same write amplification as a new
document.

Nik

On Tue, Nov 4, 2014 at 3:00 PM, Brook Miller brook@motivequest.com wrote:

Thanks for the quick response.

Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.

So for subsequent queries against the result set (1 million document ids
that we know have positive sentiment about Apple), we'd use a filter with
the 1M document ids in it (as in this)
Elasticsearch Platform — Find real-time answers at scale | Elastic

It seems like that would work using the boolean but is a huge chunk of
data to send for all the subsequent queries and a huge workload to parse.
Is there some way to send the doc ids once, have elasticsearch calculate
the bitset and thereafter alias/name that bitset for slicing and dicing as
you put it?

-Brook

On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:

If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.

The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.

Nik

On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller br...@motivequest.com
wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/252707f4-e525-48e7-8928-72b4f1db26fb%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5a57d7b6-1601-47e2-b5ca-9e95d29906e8%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd2%3D5hf6JZcUhzjZiFhuHxsdG0mN1tXxuUbDxhWrHEubyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks!

On Tuesday, November 4, 2014 10:58:44 AM UTC-8, Brook Miller wrote:

We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.

Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.

Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?

Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?

There may be other options that we don't yet know, but would love some
input on what strategy to take.

Thanks,
Brook

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a639377a-2685-4865-a068-86a7e40b13d1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.