Elasticsearch & timeseries

Hi,

I'm trying to investigate if ElasticSearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes sense
to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other fields
    that will just be stored (and the timestamp is the only field that will be
    searched for)
  • there is only one type of consumer (search query) : all documents with
    a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect) fit
for this :

  • the underlying Lucene segments are append only (also the message queue)
  • the search is very simple and as such the indexing of documents would
    have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way as to
    only create fixed size Lucene index segments (so no segment merges, only
    new segments)
  2. to configure ES to keep open only the latest X Lucene index segments
    and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through the
existing merge policies
(http://www.elasticsearch.org/guide/reference/index-modules/merge/) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

About 2), you can use rolling indexes with an alias on top of them.
So create a new index every day, modify the alias, remove (or close) the oldest index.

A closed index does not use resources anymore (only disk space). If you remove it, you will get back you disk space.

Does it answer to your needs?

See: Elasticsearch Platform — Find real-time answers at scale | Elastic

My 0.01 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet | @elasticsearchfr | @scrutmydocs

Le 24 juin 2013 à 17:21, Paul Sabou paul.sabou@gmail.com a écrit :

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes sense to explore this possibility and to see it as a special
case of segment merge policy.

The use case I'm having is an append only index :
lots of data comes in and CREATE many ES documents (ie. time series)
there is no document UPDATE & no document DELETE
each document has a timestamp that will be indexed and other fields that will just be stored (and the timestamp is the only field that will be searched for)
there is only one type of consumer (search query) : all documents with a timestamp more recent than X
The reasons I believe ES might be an acceptable (even if not perfect) fit for this :
the underlying Lucene segments are append only (also the message queue)
the search is very simple and as such the indexing of documents would have minimal performance penalty
From what I see now the challenge is :
to configure ES so that the segment merge policy in such a way as to only create fixed size Lucene index segments (so no segment merges, only new segments)
to configure ES to keep open only the latest X Lucene index segments and as such to avoid having too many open file descriptors
So, does this make sense and if so how could it be done? Went through the existing merge policies (Elasticsearch Platform — Find real-time answers at scale | Elastic) and none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi David,

Thanks for the answer. I agree with you that rolling indexes would improve
the overall architecture, but I believe this would cover only
part of the problem : How to handle a lot of data over time. This indeed is
one part of my problem.

My intention with the initial question was in fact to address the second
problem : write performance. That's the reason why I wanted
to improve things at index level. At index level, there are 3 things that
affect performance : indexing data, segment merging & queries.
My solution tries to keep indexing & queries to a minimum. But I still
didn't found a good solution for the segment merging policy.

To make my point more focused : I'm hoping to get a write performance to
the Lucene segments as close as possible to the native
FS file write. That's why I want to eliminate any segment merging inside
the index as this would affect write performance.

So I hope that somebody might know a way to configure the segment merge
policy in this way : no segment merging, all segments have the
same size.

Cheers
Paul.

On Monday, June 24, 2013 6:59:46 PM UTC+2, David Pilato wrote:

About 2), you can use rolling indexes with an alias on top of them.
So create a new index every day, modify the alias, remove (or close) the
oldest index.

A closed index does not use resources anymore (only disk space). If you
remove it, you will get back you disk space.

Does it answer to your needs?

See:
Elasticsearch Platform — Find real-time answers at scale | Elastic

My 0.01 cents

--
David Pilato | Technical Advocate | Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfrhttps://twitter.com/elasticsearchfr
| @scrutmydocs https://twitter.com/scrutmydocs

Le 24 juin 2013 à 17:21, Paul Sabou <paul....@gmail.com <javascript:>> a
écrit :

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes
sense to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other fields
    that will just be stored (and the timestamp is the only field that will be
    searched for)
  • there is only one type of consumer (search query) : all documents
    with a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect)
fit for this :

  • the underlying Lucene segments are append only (also the message
    queue)
  • the search is very simple and as such the indexing of documents
    would have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way as
    to only create fixed size Lucene index segments (so no segment merges, only
    new segments)
  2. to configure ES to keep open only the latest X Lucene index
    segments and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through the
existing merge policies (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Paul,

Have you considered Kafka for a message queue?
People have also written message queues on top of HBase and Cassandra, both
having great write performance.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 11:21:09 AM UTC-4, Paul Sabou wrote:

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes
sense to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other fields
    that will just be stored (and the timestamp is the only field that will be
    searched for)
  • there is only one type of consumer (search query) : all documents
    with a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect)
fit for this :

  • the underlying Lucene segments are append only (also the message
    queue)
  • the search is very simple and as such the indexing of documents
    would have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way as
    to only create fixed size Lucene index segments (so no segment merges, only
    new segments)
  2. to configure ES to keep open only the latest X Lucene index
    segments and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through the
existing merge policies (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Otis,

I agree with you that Apache Kafka is the best option for this. Thanks for
suggesting this.

But I believe that in some cases it's good to reduce your tech stack if
possible.
So if you're already using ES everywhere in your system then sometimes it
could make sense to implement a message queue on top of it
just to get rid of the operational complexity of having another parallel
distributed system in your backend.

But I suppose if you cannot disable those segment merges & control the open
segments (not the open indexes), then implementing a queue
on top of an ES index is much too expensive. I believe that the
optimisation should always start at index level. If you can't get this
right then
it won't help to move the problem up by using multiple indexes.

Cheers
Paul.

On Monday, June 24, 2013 11:20:42 PM UTC+2, Otis Gospodnetic wrote:

Hi Paul,

Have you considered Kafka for a message queue?
People have also written message queues on top of HBase and Cassandra,
both having great write performance.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 11:21:09 AM UTC-4, Paul Sabou wrote:

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes
sense to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other fields
    that will just be stored (and the timestamp is the only field that will be
    searched for)
  • there is only one type of consumer (search query) : all documents
    with a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect)
fit for this :

  • the underlying Lucene segments are append only (also the message
    queue)
  • the search is very simple and as such the indexing of documents
    would have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way as
    to only create fixed size Lucene index segments (so no segment merges, only
    new segments)
  2. to configure ES to keep open only the latest X Lucene index
    segments and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through the
existing merge policies (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm not associated with Kafka in any way, but I can tell you it's a
pleasure to run - you never really see it. Very low CPU and memory
footprint, very fast, very simple to operate, very few moving pieces - it
just works. We use it in both services mentioned in my signature below.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 5:30:18 PM UTC-4, Paul Sabou wrote:

Hi Otis,

I agree with you that Apache Kafka is the best option for this. Thanks for
suggesting this.

But I believe that in some cases it's good to reduce your tech stack if
possible.
So if you're already using ES everywhere in your system then sometimes it
could make sense to implement a message queue on top of it
just to get rid of the operational complexity of having another parallel
distributed system in your backend.

But I suppose if you cannot disable those segment merges & control the
open segments (not the open indexes), then implementing a queue
on top of an ES index is much too expensive. I believe that the
optimisation should always start at index level. If you can't get this
right then
it won't help to move the problem up by using multiple indexes.

Cheers
Paul.

On Monday, June 24, 2013 11:20:42 PM UTC+2, Otis Gospodnetic wrote:

Hi Paul,

Have you considered Kafka for a message queue?
People have also written message queues on top of HBase and Cassandra,
both having great write performance.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 11:21:09 AM UTC-4, Paul Sabou wrote:

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes
sense to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time
    series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other
    fields that will just be stored (and the timestamp is the only field that
    will be searched for)
  • there is only one type of consumer (search query) : all documents
    with a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect)
fit for this :

  • the underlying Lucene segments are append only (also the message
    queue)
  • the search is very simple and as such the indexing of documents
    would have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way as
    to only create fixed size Lucene index segments (so no segment merges, only
    new segments)
  2. to configure ES to keep open only the latest X Lucene index
    segments and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through
the existing merge policies (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for all the help Otis. I saw that you work for sematext. Cool
company btw.

On Monday, June 24, 2013 11:42:09 PM UTC+2, Otis Gospodnetic wrote:

I'm not associated with Kafka in any way, but I can tell you it's a
pleasure to run - you never really see it. Very low CPU and memory
footprint, very fast, very simple to operate, very few moving pieces - it
just works. We use it in both services mentioned in my signature below.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 5:30:18 PM UTC-4, Paul Sabou wrote:

Hi Otis,

I agree with you that Apache Kafka is the best option for this. Thanks
for suggesting this.

But I believe that in some cases it's good to reduce your tech stack if
possible.
So if you're already using ES everywhere in your system then sometimes it
could make sense to implement a message queue on top of it
just to get rid of the operational complexity of having another parallel
distributed system in your backend.

But I suppose if you cannot disable those segment merges & control the
open segments (not the open indexes), then implementing a queue
on top of an ES index is much too expensive. I believe that the
optimisation should always start at index level. If you can't get this
right then
it won't help to move the problem up by using multiple indexes.

Cheers
Paul.

On Monday, June 24, 2013 11:20:42 PM UTC+2, Otis Gospodnetic wrote:

Hi Paul,

Have you considered Kafka for a message queue?
People have also written message queues on top of HBase and Cassandra,
both having great write performance.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, June 24, 2013 11:21:09 AM UTC-4, Paul Sabou wrote:

Hi,

I'm trying to investigate if Elasticsearch would be an acceptable
replacement for a specific type of message queue.
I know that this is not it's intended use case but I believe it makes
sense to explore this possibility and to see it as a special
case of segment merge policy.

*The use case I'm having is an append only index :
*

  • lots of data comes in and CREATE many ES documents (ie. time
    series)
  • there is no document UPDATE & no document DELETE
  • each document has a timestamp that will be indexed and other
    fields that will just be stored (and the timestamp is the only field that
    will be searched for)
  • there is only one type of consumer (search query) : all documents
    with a timestamp more recent than X

The reasons I believe ES might be an acceptable (even if not perfect)
fit for this :

  • the underlying Lucene segments are append only (also the message
    queue)
  • the search is very simple and as such the indexing of documents
    would have minimal performance penalty

From what I see now the challenge is :

  1. to configure ES so that the segment merge policy in such a way
    as to only create fixed size Lucene index segments (so no segment merges,
    only new segments)
  2. to configure ES to keep open only the latest X Lucene index
    segments and as such to avoid having too many open file descriptors

So, does this make sense and if so how could it be done? Went through
the existing merge policies (
Elasticsearch Platform — Find real-time answers at scale | Elastic) and
none seem a good fit.

Cheers
Paul.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.