Is there a way to issue a query to Elasticsearch using the sort parameter
that limits the number of results that are sorted either per shard or
during the merge phase? I don't want to be able to accidentally load all
the documents into memory but I'm ok with returning less accurate results.
Indeed Elasticsearch builds priority queues of size from+size on each
shard in order to find the top hits, which are then merged to get the
collection-wide top hits. My understanding is that you would like to be
able to configure an upper limit for the size of this priority queue, is it
correct? I think this would be a great addition!
On Fri, Jan 24, 2014 at 9:56 AM, Nikolas Everett nik9000@gmail.com wrote:
Is there a way to issue a query to Elasticsearch using the sort parameter
that limits the number of results that are sorted either per shard or
during the merge phase? I don't want to be able to accidentally load all
the documents into memory but I'm ok with returning less accurate results.
Yeah, sorry, I was tired when I wrote that. I've seen the window sent to
each shard and now that I think about it being able to set my own max value
for that might be neat. It wasn't what I was thinking about at the time
though.
I just happened upon a something that needs the sort parameter and i
figured I should look it up and I saw that the field is loaded into
memory. My concern was that it'd be possible (but useless) to construct a
query that matches all documents and then ask Elasticsearch to sort them
all. In effect, pulling that particular field into memory. So my question
was, is there a way to limit the number of documents that need that field
pulled into memory?
Suppose I have a million documents per shard and the field I'm sorting on
takes an average of a hundred bytes, that means I'm having to slurp 100M of
stuff into memory. That isn't quick and consumes 1/300th of the heap on
the node just for one shard. In my case I'd prefer to just sort the first
ten thousand documents and warn them that the sorting wasn't wholly
accurate. I suppose I could execute a count and if the count comes back
too high then refuse to do the search at all but that seems less pleasant.
I suppose I have the same feeling about faceting as well. And, yeah, I'm
not being clear about what "the first" really means because I haven't
really thought that part through.
I did poke around the implementation and I saw that it loads the terms into
memory for each segment. I didn't see where it unpins the loaded terms,
though. Does it unpin them when it is done with the segment?
Sorry for the rambling email, I guess I'm still tired.
Indeed Elasticsearch builds priority queues of size from+size on each
shard in order to find the top hits, which are then merged to get the
collection-wide top hits. My understanding is that you would like to be
able to configure an upper limit for the size of this priority queue, is it
correct? I think this would be a great addition!
On Fri, Jan 24, 2014 at 9:56 AM, Nikolas Everett nik9000@gmail.comwrote:
Is there a way to issue a query to Elasticsearch using the sort parameter
that limits the number of results that are sorted either per shard or
during the merge phase? I don't want to be able to accidentally load all
the documents into memory but I'm ok with returning less accurate results.
I added an external post-query sorting facility that works very well.
Especially for a scan query that cannot be sorted by ES. For bounded but
accurate sorts, I implement a class that uses a TreeSort. When each search
hit is added, I build the sort keys and then add the (implements
Comparable, of course) document into the TreeSort. But a couple of
optimizations:
If the document is greater than the last entry in the TreeSort, then it
is discarded. One compare, and it's gone without touching anything.
Otherwise, I add it to the TreeSort and then remove the last element of
the TreeSort.
This is the basic strategy of, how I'm told, Oracle handles sorting, for
instance, several thousand matching records, limiting the actual response
to only 200 records, but accurately sorting so that the 200 represent the
top 200 of all of the several thousand records.
For my ES-based solution, I implement a server in front of ES that handles
all my business rules. It also handles my post-query sorting and
combinatorial facet hierarchies, but it's close to the ES server (same
machine, typically) so there is little network overhead. Then the true
clients can be far-flung but they don't see the intermediate overhead.
Anyway, this took a bit to write, but it wasn't too bad. And the TreeSort
is really fast: I can issue a scan query for all 56,000 city documents and
sort them by name, and return the top 10 matches that are accurately sorted
across all 56,000. And it's surprisingly quick and doesn't consume any
extra overhead: I'm not asking ES or Lucene to try and sort the entire mess
at once, and the TreeSort keeps my working set in the JVM as tiny as my
final response limit.
On Fri, Jan 24, 2014 at 4:08 PM, Nikolas Everett nik9000@gmail.com wrote:
I just happened upon a something that needs the sort parameter and i
figured I should look it up and I saw that the field is loaded into
memory. My concern was that it'd be possible (but useless) to construct a
query that matches all documents and then ask Elasticsearch to sort them
all. In effect, pulling that particular field into memory. So my question
was, is there a way to limit the number of documents that need that field
pulled into memory?
Suppose I have a million documents per shard and the field I'm sorting on
takes an average of a hundred bytes, that means I'm having to slurp 100M of
stuff into memory. That isn't quick and consumes 1/300th of the heap on
the node just for one shard. In my case I'd prefer to just sort the first
ten thousand documents and warn them that the sorting wasn't wholly
accurate. I suppose I could execute a count and if the count comes back
too high then refuse to do the search at all but that seems less pleasant.
I suppose I have the same feeling about faceting as well. And, yeah, I'm
not being clear about what "the first" really means because I haven't
really thought that part through.
OK, I see what you mean. For reference, if the field is 100 bytes and there
are 1M documents, this will not necessary require 100M as values are
deduplicated in field data. Additionally, there is a FST based field data
implementation that can help save even more memory when values share
prefixes and/or suffixes.
I think your idea is doable, we could trade accuracy for memory in the case
of sorting. We actually have something similar for faceting: it is possible
to only load into memory values that have high-enough frequencies. I guess
there could be something similar for sorting by only loading into memory
the first n bytes of the field that is used for sorting? (Potentially with
a few variants eg. to be able to load the exact terms for the least ones
for better accuracy)
I did poke around the implementation and I saw that it loads the terms
into memory for each segment. I didn't see where it unpins the loaded
terms, though. Does it unpin them when it is done with the segment?
If you look into IndexFieldDataCache, there is a call to
SegmentReaderUtils.registerCoreListener. This will cause field data to be
unloaded when the segment is closed.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.