When you sort on a field, Elasticsearch loads the values for that field
into memory, but it doesn't load the values just for the 10,000 matching
results. it loads that field for ALL documents in your index. The logic is:
you may need these 10,000 now, but you'll probably need a different 10,000
on another request.
So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.
Sorting large numbers of docs is expensive. Let's say you have 5 shards,
and you want the top 10 records. Each shard has to return the timestamp
for its own 10 best results. So the receiving shard gets 50 results, and
then sorts them into the final list of the overall top 10. it requests
these top 10 docs from the relevant shards and returns them to you,
discarding the other 40.
Now if you ask for the top 10,000 results, the receiving node has to sort
through 50,000 docs before discarding 40,000 of them. You see how quickly
it can get out of control.
On 16 May 2013 19:53, vinod eligeti veligeti999@gmail.com wrote:
Well I thought of the second option already and it seems thats the only
viable way however, I do have a question about the approach.
For example, I have predefined range lets say last 15 minutes and within
that time frame I get 10,000 messages which matches my query and i sort by
time range and returns the first 500. What I read in the guide is ES loads
all the timestamps of 10,000K messages as it has to do sorting and then
return only lets say 500 since that is the limit I set on the query to
return. When the user does next page then again I have to get the latest
timestamp of 500 records and issue a query which sorts (10,000 - 500)
records. Isn't it much better to get just the Ids of 10,000 and show only
500 per page and whenever user does next then instead of hitting another
query I just return next 500? Of course I need to have an upper cap of lets
say 10,000 otherwise there will be huge amount of records for wider time
range.
Sorry if my questions are too naive.
On Thu, May 16, 2013 at 10:39 AM, Clinton Gormley clint@traveljury.comwrote:
On 16 May 2013 19:04, vinod eligeti veligeti999@gmail.com wrote:
Well the I cannot control the user behavior. If he wants to view a logs
for a machine 2 days ago then I have to design the query with sorting based
on timestamp and show first 500 results. So thats the nature of the
requirement I have.
You can decide how you're going to implement it. You have two choices:
- take the naive approach of paging, the cost of which grows
exponentially, especially when you're talking about lots of logging data
- be a bit cleverer about it and use ranges, which will perform well on
every page, ie:
you request the first page (eg 500 results sorted by timestamp desc)
when you want the second page, take the earliest timestamp that you have
from the first page, and add a range filter that says: timestamp < $val
--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.