Sharding by time

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

I think it's probably the best efficient way to do it.
Dropping an index is so easy and quick.

BTW, you can have a look at the new TTL field :
http://www.elasticsearch.org/guide/reference/mapping/ttl-field.html

David.

Le 5 décembre 2011 à 22:36, "Rafał Kuć" rafal.kuc@gmail.com a écrit :

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć
--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

Hi,

On Dec 5, 4:56 pm, "da...@pilato.fr" da...@pilato.fr wrote:

I think it's probably the best efficient way to do it.
Dropping an index is so easy and quick.

The thing is, if you want to shard by day, then in 30 days you have 30
indices. Furthermore, if you want to index 3 different document types
and keep them separate, you have 30x3=90 indices. This is without any
sharding.

On the other hand, if one were able to create 3 indices in an ES
cluster and then tell ES "Hey, create date shards automatically for
me", that would be awesome.

In terms of resources, though, is 90 indices == 90 shards?

BTW, you can have a look at the new TTL field :http://www.elasticsearch.org/guide/reference/mapping/ttl-field.html

Yeah, cool, but this is for purging documents older than TTL, if I
understand correctly.

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain documents
    added in the last N days, so please manage the index alias for me to
    point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Thanks,
Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

Le 5 décembre 2011 à 22:36, "Rafał Kuć" rafal....@gmail.com a écrit :

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet

Hi,

I've had a similar conversation some time ago which should help you
figure this out:
https://mail.google.com/mail/?ui=2&shva=1#search/label%3Alist%3Aelasticsearch+rolling+window/12ad3d89e58fdd84

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal.kuc@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

Hi,

On Dec 6, 1:29 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

I've had a similar conversation some time ago which should help you
figure this out:https://mail.google.com/mail/?ui=2&shva=1#search/label%3Alist%3Aelast...

Thanks Colin.
The URL is: http://groups.google.com/a/elasticsearch.com/group/users/browse_frm/thread/faf866dc58875bea?tvc=1&q=surprenant+rolling

Unfortunately, this doesn't really touch on the questions I asked
except that it confirms 90 shards is the same as 90 indexes (assuming
no replication) as far as resources go.

Thanks anyway!

Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal....@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

Colin++,

So is the right thing here some type of index manager plugin that takes a
set of rules/thresholds and does the daily upkeep of an index? This seems
to be a common use case.

--Mike

On Wed, Dec 7, 2011 at 9:22 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com

wrote:

Hi,

On Dec 6, 1:29 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

I've had a similar conversation some time ago which should help you
figure this out:
https://mail.google.com/mail/?ui=2&shva=1#search/label%3Alist%3Aelast...

Thanks Colin.
The URL is:
http://groups.google.com/a/elasticsearch.com/group/users/browse_frm/thread/faf866dc58875bea?tvc=1&q=surprenant+rolling

Unfortunately, this doesn't really touch on the questions I asked
except that it confirms 90 shards is the same as 90 indexes (assuming
no replication) as far as resources go.

Thanks anyway!

Otis

Sematext is Hiring World-Wide -- http://sematext.com/about/jobs.html

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal....@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in
the

project I'm currently working on. We want to divide data by date
because of

the amount of data we have. We need to keep the data 'alive' for 30
days,

after that we don't need it and we can abandon it. So I was thinking
about

creating an index (with multiple shards) for each day. Alll indexes
older

than 30 days would be deleted. I was also thinking about using the
index

aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way
to

shard the data by time ?

Thanks for reply,
Rafał Kuć

Heya,

Yea, that sounds like the right approach. You will probably need only 1
shard for each daily index (the lower number of shards per daily index, the
less resources / nodes you will need). As people suggested on a thread that
spawned a bit from answering your question, deleting an index is much more
lightweight operation compared to deleting documents.

Things that can simplify usage can be aliases, and index templates
(which allow you to create custom configuration / mappings templates that
apply to indices created).

-shay.banon

On Mon, Dec 5, 2011 at 11:36 PM, Rafał Kuć rafal.kuc@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain documents
    added in the last N days, so please manage the index alias for me to
    point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Hello,

Thanks for all the answers :slight_smile:

Regards
Rafał Kuć

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mberkay@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain documents
    added in the last N days, so please manage the index alias for me to
    point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Here is one approach

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain documents
    added in the last N days, so please manage the index alias for me to
    point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

https://github.com/elasticsearch/elasticsearch/issues/1500

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,

from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For
example, in

the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if

the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain documents
    added in the last N days, so please manage the index alias for me to
    point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Tom,

Interesting profile. What are you server specs? What's your query load like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

https://github.com/elasticsearch/elasticsearch/issues/1500

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,

from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can

even go further and keep some meta data about the indices. For
example, in

the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if

the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably

better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain
    documents

added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Interesting profile. What are you server specs?

The 2 billion rows was tested on: 2U server, 4 7200RPM disks RAID-5, 2
CPU's (quad-core), 16-gb memory (8-gb heap size).

The 20 billion row test will be on slightly beefier: 2U server, 5 7200RPM
disks RAID-5, 2 CPU's (quad-core), 24-gb memory (12gb or 16-gb heap size)

What's your query load like?

Inserting at a sustained 3000 docs/sec, with bursts up to 6000 docs/sec.
Takes between 7-10 days to load 2 billion docs. I am testing multiple
indexers to see if I can increase indexing rate to 20k/sec so a large test
will be easier.

This is all done on a single server rather than distributed model, which of
course would allow even better performance. I haven't done the RAID-10 vs.
RAID-5 tests yet. We need to maximize disk capacity in a single physical
enclosure so use the biggest disks available, which means you have to deal
with slower speeds at 7200RPM and RAID-5.

Our environment is write intensive. Queries are relatively infrequent,
perhaps 1000 per day maximum. This probably means we should not be using
RAID-5 which has write penalty but no read penalty, but we want the disk
capacity. Since 6000 msgs/sec is sufficient for us at present, I haven't
done much testing on disk write performance.

For the queries themselves, Like with any indexing app, queries spanning
lots of data always take a long time. On the mailing list was discussed
how a query could not be "cancelled", for example, user makes a mistake and
queries data for last 10 months instead of last 10 days, and wants to
cancel the query. In our app, having lots of key/value pairs helps makes
the queries much faster. We can also tolerate a lag of at least 30 seconds
from when documents are inserted to when they are available in a query,
perhaps even longer but we flush at 30-seconds.

I'm working on another project using 100 Amazon EC2 micro instances, but
first need to build the automation layer as I don't want to administer 100
instances. In this case I need the distributed read performance and will
be testing HBase, MongoDB, and ES in separate trials. I am not sure the
micro instances give me enough memory, but am trying to build something
relatively inexpensively. Someone suggested IRC chat, we should have a
weekly hour chat session.

On Wed, Dec 7, 2011 at 5:12 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:

Tom,

Interesting profile. What are you server specs? What's your query load
like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

https://github.com/elasticsearch/elasticsearch/issues/1500

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC
in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in
the

external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,

from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can

even go further and keep some meta data about the indices. For
example, in

the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it
can

determine which indices to run the query against, which index to
reopen if

the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably

better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so
    please

manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

  • "Hey, I'm interested in searching only shards that contain
    documents

added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Hi Tom,

Thanks for all of the detail. I have some questions that you and Shay could
shed some light on.

On Thu, Dec 8, 2011 at 4:52 AM, Tom Le dottom@gmail.com wrote:

Interesting profile. What are you server specs?

The 2 billion rows was tested on: 2U server, 4 7200RPM disks RAID-5, 2
CPU's (quad-core), 16-gb memory (8-gb heap size).

In this config or the one below, how do you get around having a single
node? Is your data stored elsewhere? Even so, a 7-day window to recreate
would be a pain.

The 20 billion row test will be on slightly beefier: 2U server, 5 7200RPM
disks RAID-5, 2 CPU's (quad-core), 24-gb memory (12gb or 16-gb heap size)

What's your query load like?

Inserting at a sustained 3000 docs/sec, with bursts up to 6000 docs/sec.
Takes between 7-10 days to load 2 billion docs. I am testing multiple
indexers to see if I can increase indexing rate to 20k/sec so a large test
will be easier.

This is all done on a single server rather than distributed model, which
of course would allow even better performance. I haven't done the RAID-10
vs. RAID-5 tests yet. We need to maximize disk capacity in a single
physical enclosure so use the biggest disks available, which means you have
to deal with slower speeds at 7200RPM and RAID-5.

Our environment is write intensive. Queries are relatively infrequent,

perhaps 1000 per day maximum. This probably means we should not be using
RAID-5 which has write penalty but no read penalty, but we want the disk
capacity. Since 6000 msgs/sec is sufficient for us at present, I haven't
done much testing on disk write performance.

Do you know what's limiting your write speed (CPU, I/O, ...)? Do you think
you'd get better I/O using JBOD vs. RAID?

Shay - even on a single machine, will ES ensure replicas are on different
disks if given multiple paths for the index?

For the queries themselves, Like with any indexing app, queries spanning
lots of data always take a long time. On the mailing list was discussed
how a query could not be "cancelled", for example, user makes a mistake and
queries data for last 10 months instead of last 10 days, and wants to
cancel the query. In our app, having lots of key/value pairs helps makes
the queries much faster. We can also tolerate a lag of at least 30 seconds
from when documents are inserted to when they are available in a query,
perhaps even longer but we flush at 30-seconds.

I'm working on another project using 100 Amazon EC2 micro instances, but
first need to build the automation layer as I don't want to administer 100
instances. In this case I need the distributed read performance and will
be testing HBase, MongoDB, and ES in separate trials. I am not sure the
micro instances give me enough memory, but am trying to build something
relatively inexpensively. Someone suggested IRC chat, we should have a
weekly hour chat session.

Well you can start with micro and see how the various configurations work
for you. I've been testing Cassandra vs. HBase and find it competitive as
well. Basically I'm going to be writing to either HBase or Cassandra if a
client wants a write behind data store and then syncing to ES. There are
many good things about HBase but I would not make the claim that it's easy
to administer.

On Wed, Dec 7, 2011 at 5:12 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:

Tom,

Interesting profile. What are you server specs? What's your query load
like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com
wrote:

Here is one approach

https://github.com/elasticsearch/elasticsearch/issues/1500

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC
in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in
the

external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,

from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can

even go further and keep some meta data about the indices. For
example, in

the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it
can

determine which indices to run the query against, which index to
reopen if

the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably

better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so
    please

manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

  • "Hey, I'm interested in searching only shards that contain
    documents

added in the last N days, so please manage the index alias for me
to

point to only last N shards"

I think neither is doable today, other than managing the index
alias

and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Hi Tom,

I wanted to know how you created daily indices. And also wanted to ask that
if i want to keep only 30 days data. how can i go about it? I do not want
to manually delete the old indices after 30 days.
On Thursday, 8 December 2011 06:39:19 UTC+5:30, Tom Le wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" <tabley...@googlemail.com<javascript:>>
wrote:

Here is one approach

https://github.com/elasticsearch/elasticsearch/issues/1500

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,

from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can

even go further and keep some meta data about the indices. For
example, in

the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if

the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably

better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

  • "Hey, I'm interested in searching only the last N shards, so please
    manage the index alias for me to point to only last N shards, so I
    don't have to manage this from my app"
    or
  • "Hey, I'm interested in searching only shards that contain
    documents

added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.