Sort before filter?

David_Pfeffer · March 19, 2014, 11:45am

I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.

ElasticSearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
single one.

Can I get ElasticSearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.

(Cross-posted
from http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch
http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch,
because I didn't get an answer there for 2 days.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/176b2d65-25af-4416-8cc3-0b82b71ad311%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Georgi_Ivanov · March 19, 2014, 9:02pm

I think sorting first, will be bad if u have more data.
Sorting is not exaclty the fasted thinkg ..
It may sound good for small amount of data, but what if we have 10 B
documents ? Should ES go trought all documents just to sort them ?

I don't think this will be good.

On Wednesday, March 19, 2014 12:45:43 PM UTC+1, David Pfeffer wrote:

I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.

Elasticsearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
single one.

Can I get Elasticsearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.

(Cross-posted
from sorting - Sort before filters in ElasticSearch - Stack Overflow
http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch,
because I didn't get an answer there for 2 days.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/710176fc-2b8a-4046-b27a-7e25457f026c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David_Pfeffer · March 19, 2014, 9:20pm

I guess I should have said instead, can I store it in sorted order? From
what I've been told the answer is no, so I'm not sure what other solution I
can take here other than more nodes.

On Wed, Mar 19, 2014 at 5:02 PM, Georgi Ivanov georgi.r.ivanov@gmail.comwrote:

I think sorting first, will be bad if u have more data.
Sorting is not exaclty the fasted thinkg ..
It may sound good for small amount of data, but what if we have 10 B
documents ? Should ES go trought all documents just to sort them ?

I don't think this will be good.

On Wednesday, March 19, 2014 12:45:43 PM UTC+1, David Pfeffer wrote:

I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.

Elasticsearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
single one.

Can I get Elasticsearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.

(Cross-posted
from sorting - Sort before filters in ElasticSearch - Stack Overflow
http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch,
because I didn't get an answer there for 2 days.)

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TsKKCT8HVxE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/710176fc-2b8a-4046-b27a-7e25457f026c%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/710176fc-2b8a-4046-b27a-7e25457f026c%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAP9-dnW-mu%2BS1zZWC85fYtgtfWULquskKbMGRdPAV3r0HXa1rQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Georgi_Ivanov · March 19, 2014, 10:09pm

I don't know what kind of problems you have .

You may try to post your mappings, number of documents , index count, sever
count, server configuration (memory ?) etc.. here and we can try to think
something.

30Gb doesnt sound so much for ES

On Wednesday, March 19, 2014 12:45:43 PM UTC+1, David Pfeffer wrote:

I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.

Elasticsearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
single one.

Can I get Elasticsearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.

(Cross-posted
from sorting - Sort before filters in ElasticSearch - Stack Overflow
http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch,
because I didn't get an answer there for 2 days.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/144aee16-9949-44a2-8a56-6b1f1b2f81fa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

polyfractal · March 20, 2014, 1:45am

Sorry for the delay between my twitter response and my reply here.

Basically, sorting first and then performing query/filter matches is not
really a tenable solution, due to memory constraints. If you were to sort
first, you would need to sort the documents (which may be very expensive
over say 5bn docs), and then maintain that sorted order in memory so you
can perform the next query. The memory overhead is the real reason why it
won't work - maintaining that sort in memory is just not
acceptable...especially if you consider fifty or a hundred concurrent
search requests all trying to maintain the sort in memory.

It would just fall apart because there is no way you can guarantee enough
memory to satisfy the operation. With the current arrangement, the query
latency may increase as load increases, but you won't OOM when the number
of queries hits a critical point

The way Elasticsearch executes queries is basically like this:

Filters are executed and "mask" the index. Only documents that match
the set of filters will be evaluated by the query. Filter evaluation is
extremely fast...much faster than performing a sort. Especially once the
filter is cached, it is basically bitwise operations
The query evaluates the documents that match the filter and generates
a score
This score is placed into a priority queue that is size "from" +
"size". If you request "from:0" and "size:10", each shard maintains a
priority queue of size 10. When documents are added to the priority queue,
the PQ will see if the score is greater than the least value in the queue.
If it is, the value is inserted and the least value is evicted. PQs
guarantee the top N results based on the score. So you can see that ES
isn't really "sorting" the results, it is just generating a score and
seeing if it is in the top N results. This is why it can scale to billions
of docs.
Since you are scoring by time, the score value returned for each
document is basically the timestamp
These PQs are merged on the coordinating node

Could you post your query? We may be able to help with optimizations, or
suggest alternatives to speed it up like rescoring. What query latency are
you seeing, and what would you like it to be? What does your system load
and cluster look like?

As to your question about...we are investigating ways to change how data is
stored in segments. Currently the storage order is effectively random,
because this is the most performant way to merge segments (since you don't
need to care about order). An alternative is to merge segments in some
order, such as timestamp. This would considerably slow down merging, but
would speed up operations like time-series analysis. We're looking into
it, but nothing firm yet.

-Z

On Wednesday, March 19, 2014 6:45:43 AM UTC-5, David Pfeffer wrote:

I have an index that contains 30 GB worth of news stories. I want to
return the stories that contain a particular name in their text, sorted
chronologically. I only want the first 100 stories.

Elasticsearch seems to approach this problem by filtering every story to
just those that match, then sorting those results and returning the top
100. This uses a reasonably large amount of resources to filter every
single one.

Can I get Elasticsearch to instead sort first, and then filter in order
until it reaches the maximum (100). Granted that this would be 100 per
shard, but then the final step would be to take each shard's 100, sort them
all together, and take the top 100 of that result set. This should, at
least in my mind, use significantly less resources, as it would only need
to go through maybe 5000 or 10000 items to find a match, as opposed to the
entirety of the index.

(Cross-posted
from sorting - Sort before filters in ElasticSearch - Stack Overflow
http://stackoverflow.com/questions/22467585/sort-before-filters-in-elasticsearch,
because I didn't get an answer there for 2 days.)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/44a52c0b-10e7-4e73-b1cd-7112b5513d30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.