We are pretty comfortable with Lucene and have used it extensively but we'd
like to move from our existing proprietary application layer to using
ElasticSearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.
Imagine we have an index of 100M tweets. We can run a query to get a set of
document ids that contain the word "apple". Now we farm out to a bunch of
worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.
Is there a way to tag these 1,000,000 document ids so that subsequent
elastic search filters / queries would be restricted to this set of
documents without re-indexing?
Alternatively, can we somehow write an add on that let's us upload and name
this set of document ids for future queries / filters?
There may be other options that we don't yet know, but would love some
input on what strategy to take.
If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.
The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.
We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.
Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.
Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?
Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?
There may be other options that we don't yet know, but would love some
input on what strategy to take.
Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.
So for subsequent queries against the result set (1 million document ids
that we know have positive sentiment about Apple), we'd use a filter with
the 1M document ids in it (as in this)
It seems like that would work using the boolean but is a huge chunk of data
to send for all the subsequent queries and a huge workload to parse. Is
there some way to send the doc ids once, have elasticsearch calculate the
bitset and thereafter alias/name that bitset for slicing and dicing as you
put it?
-Brook
On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:
If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.
The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.
Nik
On Tue, Nov 4, 2014 at 1:58 PM, Brook Miller <br...@motivequest.com
<javascript:>> wrote:
We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.
Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.
Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?
Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?
There may be other options that we don't yet know, but would love some
input on what strategy to take.
Yeah - what I was suggesting was if that your filter was actually inside of
elasticsearch (as a plugin or whatever) then you could get elasticsearch to
automatically cache it. If the process is external then it wouldn't work.
There isn't anything that lets you build a huge id filter by sending the
ids once and caching it forever. There is a stalled pull request for a
join filter which could do the trick:
but it stalled and has been for while. I honestly haven't been following
it closely so I'm not sure it'd even be the right fit. Beyond that I don't
have an answer for you.
For those follow along at home you don't want to perform an update to the
document because that operation is very costly - its basically a delete and
a reindex. The deleted documents have to be merged out and the update has
to be merged up the segments and gets the same write amplification as a new
document.
Just to be clear I want to be able to query (slice and dice) against the
set of documents defined by ID, that I through a process external to
Elasticsearch computed have positive sentiment towards apple.
It seems like that would work using the boolean but is a huge chunk of
data to send for all the subsequent queries and a huge workload to parse.
Is there some way to send the doc ids once, have elasticsearch calculate
the bitset and thereafter alias/name that bitset for slicing and dicing as
you put it?
-Brook
On Tuesday, November 4, 2014 11:04:28 AM UTC-8, Nikolas Everett wrote:
If you implement your "tweets that mention apple" as a filter then it can
be cached. Elasticsearch's cache is per segment so it should stay sane as
you add more documents. That might be enough to make that fast.
The other option is to walk those 1,000,000 million documents with a
scan/scroll query all at once but that is really only good if you want to
list them or do something with them, not if you want to slice and dice them.
We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.
Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.
Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?
Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?
There may be other options that we don't yet know, but would love some
input on what strategy to take.
On Tuesday, November 4, 2014 10:58:44 AM UTC-8, Brook Miller wrote:
We are pretty comfortable with Lucene and have used it extensively but
we'd like to move from our existing proprietary application layer to using
Elasticsearch (support for merging results across shards, replication). We
have one key challenge, which I'm doing my best to describe below, I'd love
to get some thoughts from the elasticsearch team on what direction we
should take.
Imagine we have an index of 100M tweets. We can run a query to get a set
of document ids that contain the word "apple". Now we farm out to a bunch
of worker processes to analzye and return document ids of tweets that have
positive sentiment towards apple. Assume, we now have 1 million document
ids that we know have positive sentiment towards the word apple.
Is there a way to tag these 1,000,000 document ids so that subsequent
Elasticsearch filters / queries would be restricted to this set of
documents without re-indexing?
Alternatively, can we somehow write an add on that let's us upload and
name this set of document ids for future queries / filters?
There may be other options that we don't yet know, but would love some
input on what strategy to take.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.