How to fetch time series messages

Hi,

I have general question as how to go through time series data. What I am
trying to do is to store logs which are stamped by time. The doc type looks
like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved and
stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine with in
a given time range, I can create a filter of that time range and also sort
it and fetch the data. I plan to use SCROLL to fetch the data however,
scroll does not support sorting which invariably leads to the question how
do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It appears you would have to stick with standard queries instead of Scroll.

One of the main issues with using queries to page through documents is that
order is not preserved., causing documents to be duplicated or ignored.
However your use case is to order historical documents, so the issue
is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I am
trying to do is to store logs which are stamped by time. The doc type looks
like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved and
stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine with in
a given time range, I can create a filter of that time range and also sort
it and fetch the data. I plan to use SCROLL to fetch the data however,
scroll does not support sorting which invariably leads to the question how
do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My concern is if I have to fetch for example 10,000 documents for the time
range of last 15 minutes all at once then it will be huge list and assuming
that I have to do some sort of pagination let say 100 documents at a time
in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I am
trying to do is to store logs which are stamped by time. The doc type looks
like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved and
stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine with
in a given time range, I can create a filter of that time range and also
sort it and fetch the data. I plan to use SCROLL to fetch the data however,
scroll does not support sorting which invariably leads to the question how
do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Why do you need the list of ids for the 10K docs? A query will return a
count of documents matched and return only a subset of those documents. How
often did you envision users paging far into the results?

--
Ivan

On Tue, May 14, 2013 at 10:14 AM, vinod eligeti veligeti999@gmail.comwrote:

My concern is if I have to fetch for example 10,000 documents for the time
range of last 15 minutes all at once then it will be huge list and assuming
that I have to do some sort of pagination let say 100 documents at a time
in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of
Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I am
trying to do is to store logs which are stamped by time. The doc type looks
like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved
and stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine with
in a given time range, I can create a filter of that time range and also
sort it and fetch the data. I plan to use SCROLL to fetch the data however,
scroll does not support sorting which invariably leads to the question how
do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The query itself matches 10K docs Ivan? These are logs so some user goes
and browsers for the last 30 minutes or few hours which will fetch 10K or
more documents. This is pretty common usage pattern.

On Wed, May 15, 2013 at 10:20 AM, Ivan Brusic ivan@brusic.com wrote:

Why do you need the list of ids for the 10K docs? A query will return a
count of documents matched and return only a subset of those documents. How
often did you envision users paging far into the results?

--
Ivan

On Tue, May 14, 2013 at 10:14 AM, vinod eligeti veligeti999@gmail.comwrote:

My concern is if I have to fetch for example 10,000 documents for the
time range of last 15 minutes all at once then it will be huge list and
assuming that I have to do some sort of pagination let say 100 documents at
a time in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of
Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I
am trying to do is to store logs which are stamped by time. The doc type
looks like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved
and stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine with
in a given time range, I can create a filter of that time range and also
sort it and fetch the data. I plan to use SCROLL to fetch the data however,
scroll does not support sorting which invariably leads to the question how
do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Logs are particularly easy to query as they have timestamps. Instead of
asking for "page 10" you could just ask for the "most recent 100 entries
before timestamp X"

clint

On 15 May 2013 20:13, vinod eligeti veligeti999@gmail.com wrote:

The query itself matches 10K docs Ivan? These are logs so some user goes
and browsers for the last 30 minutes or few hours which will fetch 10K or
more documents. This is pretty common usage pattern.

On Wed, May 15, 2013 at 10:20 AM, Ivan Brusic ivan@brusic.com wrote:

Why do you need the list of ids for the 10K docs? A query will return a
count of documents matched and return only a subset of those documents. How
often did you envision users paging far into the results?

--
Ivan

On Tue, May 14, 2013 at 10:14 AM, vinod eligeti veligeti999@gmail.comwrote:

My concern is if I have to fetch for example 10,000 documents for the
time range of last 15 minutes all at once then it will be huge list and
assuming that I have to do some sort of pagination let say 100 documents at
a time in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of
Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I
am trying to do is to store logs which are stamped by time. The doc type
looks like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved
and stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine
with in a given time range, I can create a filter of that time range and
also sort it and fetch the data. I plan to use SCROLL to fetch the data
however, scroll does not support sorting which invariably leads to the
question how do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I mention that because deep paging in a distributed environment can become
very costly. The effort required to sort docs grows exponentially, the
deeper you go

clint

On 16 May 2013 11:02, Clinton Gormley clint@traveljury.com wrote:

Logs are particularly easy to query as they have timestamps. Instead of
asking for "page 10" you could just ask for the "most recent 100 entries
before timestamp X"

clint

On 15 May 2013 20:13, vinod eligeti veligeti999@gmail.com wrote:

The query itself matches 10K docs Ivan? These are logs so some user goes
and browsers for the last 30 minutes or few hours which will fetch 10K or
more documents. This is pretty common usage pattern.

On Wed, May 15, 2013 at 10:20 AM, Ivan Brusic ivan@brusic.com wrote:

Why do you need the list of ids for the 10K docs? A query will return a
count of documents matched and return only a subset of those documents. How
often did you envision users paging far into the results?

--
Ivan

On Tue, May 14, 2013 at 10:14 AM, vinod eligeti veligeti999@gmail.comwrote:

My concern is if I have to fetch for example 10,000 documents for the
time range of last 15 minutes all at once then it will be huge list and
assuming that I have to do some sort of pagination let say 100 documents at
a time in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of
Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I
am trying to do is to store logs which are stamped by time. The doc type
looks like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved
and stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine
with in a given time range, I can create a filter of that time range and
also sort it and fetch the data. I plan to use SCROLL to fetch the data
however, scroll does not support sorting which invariably leads to the
question how do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Well the I cannot control the user behavior. If he wants to view a logs for
a machine 2 days ago then I have to design the query with sorting based on
timestamp and show first 500 results. So thats the nature of the
requirement I have.

On Thu, May 16, 2013 at 2:02 AM, Clinton Gormley clint@traveljury.comwrote:

Logs are particularly easy to query as they have timestamps. Instead of
asking for "page 10" you could just ask for the "most recent 100 entries
before timestamp X"

clint

On 15 May 2013 20:13, vinod eligeti veligeti999@gmail.com wrote:

The query itself matches 10K docs Ivan? These are logs so some user goes
and browsers for the last 30 minutes or few hours which will fetch 10K or
more documents. This is pretty common usage pattern.

On Wed, May 15, 2013 at 10:20 AM, Ivan Brusic ivan@brusic.com wrote:

Why do you need the list of ids for the 10K docs? A query will return a
count of documents matched and return only a subset of those documents. How
often did you envision users paging far into the results?

--
Ivan

On Tue, May 14, 2013 at 10:14 AM, vinod eligeti veligeti999@gmail.comwrote:

My concern is if I have to fetch for example 10,000 documents for the
time range of last 15 minutes all at once then it will be huge list and
assuming that I have to do some sort of pagination let say 100 documents at
a time in the GUI then I have to implement pagination at the server side.

So by doing ES querying I first get all the ids of 10,000 docs (not the
_source field since its a huge data due to perf reasons), maintain a state
between GUI and server with some sort of query Id, similar to scroll Id,
fetch the latest 100 documents with _source and whenever the user does
next/back get the corresponding documents with _source. If my understanding
is correct is that what you are referring to Ivan?

thanks

On Tue, May 14, 2013 at 8:10 AM, Ivan Brusic ivan@brusic.com wrote:

It appears you would have to stick with standard queries instead of
Scroll.

One of the main issues with using queries to page through documents is
that order is not preserved., causing documents to be duplicated or
ignored. However your use case is to order historical documents, so the
issue is alleviated. Are you worried about performance?

Cheers,

Ivan

On Mon, May 13, 2013 at 2:50 PM, vinod eligeti veligeti999@gmail.comwrote:

Hi,

I have general question as how to go through time series data. What I
am trying to do is to store logs which are stamped by time. The doc type
looks like

log
{
machine : String
timestamp : date
message : String

... bunch of other fields
}

machine is the host name of the machine from which logs are retrieved
and stored in ES. I have bunch of other custom fields to store which is the
reason why I am not using Logstash.

So my question is if I want to fetch data for a particular machine
with in a given time range, I can create a filter of that time range and
also sort it and fetch the data. I plan to use SCROLL to fetch the data
however, scroll does not support sorting which invariably leads to the
question how do i fetch the data in the sorted order based on timestamp?

I apologize if this question is already answered.

thanks

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 16 May 2013 19:04, vinod eligeti veligeti999@gmail.com wrote:

Well the I cannot control the user behavior. If he wants to view a logs
for a machine 2 days ago then I have to design the query with sorting based
on timestamp and show first 500 results. So thats the nature of the
requirement I have.

You can decide how you're going to implement it. You have two choices:

  1. take the naive approach of paging, the cost of which grows
    exponentially, especially when you're talking about lots of logging data
  2. be a bit cleverer about it and use ranges, which will perform well on
    every page, ie:

you request the first page (eg 500 results sorted by timestamp desc)

when you want the second page, take the earliest timestamp that you have
from the first page, and add a range filter that says: timestamp < $val

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Well I thought of the second option already and it seems thats the only
viable way however, I do have a question about the approach.

For example, I have predefined range lets say last 15 minutes and within
that time frame I get 10,000 messages which matches my query and i sort by
time range and returns the first 500. What I read in the guide is ES loads
all the timestamps of 10,000K messages as it has to do sorting and then
return only lets say 500 since that is the limit I set on the query to
return. When the user does next page then again I have to get the latest
timestamp of 500 records and issue a query which sorts (10,000 - 500)
records. Isn't it much better to get just the Ids of 10,000 and show only
500 per page and whenever user does next then instead of hitting another
query I just return next 500? Of course I need to have an upper cap of lets
say 10,000 otherwise there will be huge amount of records for wider time
range.

Sorry if my questions are too naive.

On Thu, May 16, 2013 at 10:39 AM, Clinton Gormley clint@traveljury.comwrote:

On 16 May 2013 19:04, vinod eligeti veligeti999@gmail.com wrote:

Well the I cannot control the user behavior. If he wants to view a logs
for a machine 2 days ago then I have to design the query with sorting based
on timestamp and show first 500 results. So thats the nature of the
requirement I have.

You can decide how you're going to implement it. You have two choices:

  1. take the naive approach of paging, the cost of which grows
    exponentially, especially when you're talking about lots of logging data
  2. be a bit cleverer about it and use ranges, which will perform well on
    every page, ie:

you request the first page (eg 500 results sorted by timestamp desc)

when you want the second page, take the earliest timestamp that you have
from the first page, and add a range filter that says: timestamp < $val

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

When you sort on a field, Elasticsearch loads the values for that field
into memory, but it doesn't load the values just for the 10,000 matching
results. it loads that field for ALL documents in your index. The logic is:
you may need these 10,000 now, but you'll probably need a different 10,000
on another request.

So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.

Sorting large numbers of docs is expensive. Let's say you have 5 shards,
and you want the top 10 records. Each shard has to return the timestamp
for its own 10 best results. So the receiving shard gets 50 results, and
then sorts them into the final list of the overall top 10. it requests
these top 10 docs from the relevant shards and returns them to you,
discarding the other 40.

Now if you ask for the top 10,000 results, the receiving node has to sort
through 50,000 docs before discarding 40,000 of them. You see how quickly
it can get out of control.

On 16 May 2013 19:53, vinod eligeti veligeti999@gmail.com wrote:

Well I thought of the second option already and it seems thats the only
viable way however, I do have a question about the approach.

For example, I have predefined range lets say last 15 minutes and within
that time frame I get 10,000 messages which matches my query and i sort by
time range and returns the first 500. What I read in the guide is ES loads
all the timestamps of 10,000K messages as it has to do sorting and then
return only lets say 500 since that is the limit I set on the query to
return. When the user does next page then again I have to get the latest
timestamp of 500 records and issue a query which sorts (10,000 - 500)
records. Isn't it much better to get just the Ids of 10,000 and show only
500 per page and whenever user does next then instead of hitting another
query I just return next 500? Of course I need to have an upper cap of lets
say 10,000 otherwise there will be huge amount of records for wider time
range.

Sorry if my questions are too naive.

On Thu, May 16, 2013 at 10:39 AM, Clinton Gormley clint@traveljury.comwrote:

On 16 May 2013 19:04, vinod eligeti veligeti999@gmail.com wrote:

Well the I cannot control the user behavior. If he wants to view a logs
for a machine 2 days ago then I have to design the query with sorting based
on timestamp and show first 500 results. So thats the nature of the
requirement I have.

You can decide how you're going to implement it. You have two choices:

  1. take the naive approach of paging, the cost of which grows
    exponentially, especially when you're talking about lots of logging data
  2. be a bit cleverer about it and use ranges, which will perform well on
    every page, ie:

you request the first page (eg 500 results sorted by timestamp desc)

when you want the second page, take the earliest timestamp that you have
from the first page, and add a range filter that says: timestamp < $val

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for taking time to explain the behavior however, I do have few
questions to follow up.

Lets assume in a given index for a given query the number of records that
matches is 10K but if there are 10Million messages in that index then
according to you sorting is done on 10Million messages which is REALLY
huge. Isn't it a better way to accomplish this instead of sorting all
messages but rather only 10K? Also in our environment there are high
changes that at least 10Billion messages are inserted in any given day as
we plan to divert lot of machines logs to ES and partition per day basis.
So effectively you are saying all these 10Billion messages will be sorted.

{{So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.}}

Are you referring that somehow this sorted order is stored once the user
perform initial sorting?

On Thu, May 16, 2013 at 11:02 AM, Clinton Gormley clint@traveljury.comwrote:

When you sort on a field, Elasticsearch loads the values for that field
into memory, but it doesn't load the values just for the 10,000 matching
results. it loads that field for ALL documents in your index. The logic is:
you may need these 10,000 now, but you'll probably need a different 10,000
on another request.

So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.

Sorting large numbers of docs is expensive. Let's say you have 5 shards,
and you want the top 10 records. Each shard has to return the timestamp
for its own 10 best results. So the receiving shard gets 50 results, and
then sorts them into the final list of the overall top 10. it requests
these top 10 docs from the relevant shards and returns them to you,
discarding the other 40.

Now if you ask for the top 10,000 results, the receiving node has to sort
through 50,000 docs before discarding 40,000 of them. You see how quickly
it can get out of control.

On 16 May 2013 19:53, vinod eligeti veligeti999@gmail.com wrote:

Well I thought of the second option already and it seems thats the only
viable way however, I do have a question about the approach.

For example, I have predefined range lets say last 15 minutes and within
that time frame I get 10,000 messages which matches my query and i sort by
time range and returns the first 500. What I read in the guide is ES loads
all the timestamps of 10,000K messages as it has to do sorting and then
return only lets say 500 since that is the limit I set on the query to
return. When the user does next page then again I have to get the latest
timestamp of 500 records and issue a query which sorts (10,000 - 500)
records. Isn't it much better to get just the Ids of 10,000 and show only
500 per page and whenever user does next then instead of hitting another
query I just return next 500? Of course I need to have an upper cap of lets
say 10,000 otherwise there will be huge amount of records for wider time
range.

Sorry if my questions are too naive.

On Thu, May 16, 2013 at 10:39 AM, Clinton Gormley clint@traveljury.comwrote:

On 16 May 2013 19:04, vinod eligeti veligeti999@gmail.com wrote:

Well the I cannot control the user behavior. If he wants to view a logs
for a machine 2 days ago then I have to design the query with sorting based
on timestamp and show first 500 results. So thats the nature of the
requirement I have.

You can decide how you're going to implement it. You have two choices:

  1. take the naive approach of paging, the cost of which grows
    exponentially, especially when you're talking about lots of logging data
  2. be a bit cleverer about it and use ranges, which will perform well on
    every page, ie:

you request the first page (eg 500 results sorted by timestamp desc)

when you want the second page, take the earliest timestamp that you have
from the first page, and add a range filter that says: timestamp < $val

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On 16 May 2013 20:21, vinod eligeti veligeti999@gmail.com wrote:

Lets assume in a given index for a given query the number of records that
matches is 10K but if there are 10Million messages in that index then
according to you sorting is done on 10Million messages which is REALLY
huge. Isn't it a better way to accomplish this instead of sorting all
messages but rather only 10K?

No. I'm not saying that it sorts all 10M docs. It just sorts the matching
docs. But in order to sort it has to have the field values loaded into
memory, and it loads the value of that field for ALL docs, not just for
matching docs.

{{So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.}}

Are you referring that somehow this sorted order is stored once the user
perform initial sorting?

Not that the order is stored, but that the field values are already loaded.
The I/O is the slow part, not the sorting.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

One last question, when you said ES loads field of all messages what
strategy does it uses to load the messages? Does it load in batches? if so
are there settings to do that? Because the queries that I am suggesting has
some serious memory impact.

On Thu, May 16, 2013 at 12:41 PM, Clinton Gormley clint@traveljury.comwrote:

On 16 May 2013 20:21, vinod eligeti veligeti999@gmail.com wrote:

Lets assume in a given index for a given query the number of records that
matches is 10K but if there are 10Million messages in that index then
according to you sorting is done on 10Million messages which is REALLY
huge. Isn't it a better way to accomplish this instead of sorting all
messages but rather only 10K?

No. I'm not saying that it sorts all 10M docs. It just sorts the matching
docs. But in order to sort it has to have the field values loaded into
memory, and it loads the value of that field for ALL docs, not just for
matching docs.

{{So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.}}

Are you referring that somehow this sorted order is stored once the user
perform initial sorting?

Not that the order is stored, but that the field values are already
loaded. The I/O is the slow part, not the sorting.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It loads the values for each segment, and keeps them in memory. As you
index, new segments are created and have their values loaded. As old
segments are merged into new segments, the values from the old segments are
removed from memory.

You can use the warmer APIs to ensure that the values from new segments are
preloaded, so that the user won't experience a slow response when running a
query for the first time

clint

On 17 May 2013 23:50, vinod eligeti veligeti999@gmail.com wrote:

One last question, when you said ES loads field of all messages what
strategy does it uses to load the messages? Does it load in batches? if so
are there settings to do that? Because the queries that I am suggesting has
some serious memory impact.

On Thu, May 16, 2013 at 12:41 PM, Clinton Gormley clint@traveljury.comwrote:

On 16 May 2013 20:21, vinod eligeti veligeti999@gmail.com wrote:

Lets assume in a given index for a given query the number of records
that matches is 10K but if there are 10Million messages in that index then
according to you sorting is done on 10Million messages which is REALLY
huge. Isn't it a better way to accomplish this instead of sorting all
messages but rather only 10K?

No. I'm not saying that it sorts all 10M docs. It just sorts the
matching docs. But in order to sort it has to have the field values loaded
into memory, and it loads the value of that field for ALL docs, not just
for matching docs.

{{So actually, rerunning the query isn't as costly as you think. ES has
already done the hard work for all your docs anyway.}}

Are you referring that somehow this sorted order is stored once the user
perform initial sorting?

Not that the order is stored, but that the field values are already
loaded. The I/O is the slow part, not the sorting.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/TJKfuEv9AbU/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.