Sorry for the delay between my twitter response and my reply here.
Basically, sorting first and then performing query/filter matches is not
really a tenable solution, due to memory constraints. If you were to sort
first, you would need to sort the documents (which may be very expensive
over say 5bn docs), and then maintain that sorted order in memory so you
can perform the next query. The memory overhead is the real reason why it
won't work - maintaining that sort in memory is just not
acceptable...especially if you consider fifty or a hundred concurrent
search requests all trying to maintain the sort in memory.
It would just fall apart because there is no way you can guarantee enough
memory to satisfy the operation. With the current arrangement, the query
latency may increase as load increases, but you won't OOM when the number
of queries hits a critical point
The way Elasticsearch executes queries is basically like this:
- Filters are executed and "mask" the index. Only documents that match
the set of filters will be evaluated by the query. Filter evaluation is
extremely fast...much faster than performing a sort. Especially once the
filter is cached, it is basically bitwise operations
- The query evaluates the documents that match the filter and generates
- This score is placed into a priority queue that is size "from" +
"size". If you request "from:0" and "size:10", each shard maintains a
priority queue of size 10. When documents are added to the priority queue,
the PQ will see if the score is greater than the least value in the queue.
If it is, the value is inserted and the least value is evicted. PQs
guarantee the top N results based on the score. So you can see that ES
isn't really "sorting" the results, it is just generating a score and
seeing if it is in the top N results. This is why it can scale to billions
- Since you are scoring by time, the score value returned for each
document is basically the timestamp
- These PQs are merged on the coordinating node
Could you post your query? We may be able to help with optimizations, or
suggest alternatives to speed it up like rescoring. What query latency are
you seeing, and what would you like it to be? What does your system load
and cluster look like?
As to your question about...we are investigating ways to change how data is
stored in segments. Currently the storage order is effectively random,
because this is the most performant way to merge segments (since you don't
need to care about order). An alternative is to merge segments in some
order, such as timestamp. This would considerably slow down merging, but
would speed up operations like time-series analysis. We're looking into
it, but nothing firm yet.
On Wednesday, March 19, 2014 6:45:43 AM UTC-5, David Pfeffer wrote:
I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.
ElasticSearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
Can I get ElasticSearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.
because I didn't get an answer there for 2 days.)
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to firstname.lastname@example.org.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44a52c0b-10e7-4e73-b1cd-7112b5513d30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.