Sharding by time

Rafal_Kuc · December 5, 2011, 9:36pm

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

dadoonet · December 5, 2011, 9:56pm

I think it's probably the best efficient way to do it.
Dropping an index is so easy and quick.

BTW, you can have a look at the new TTL field :

David.

Le 5 décembre 2011 à 22:36, "Rafał Kuć" rafal.kuc@gmail.com a écrit :

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć
--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

otisg · December 6, 2011, 6:03pm

Hi,

On Dec 5, 4:56 pm, "da...@pilato.fr" da...@pilato.fr wrote:

I think it's probably the best efficient way to do it.
Dropping an index is so easy and quick.

The thing is, if you want to shard by day, then in 30 days you have 30
indices. Furthermore, if you want to index 3 different document types
and keep them separate, you have 30x3=90 indices. This is without any
sharding.

On the other hand, if one were able to create 3 indices in an ES
cluster and then tell ES "Hey, create date shards automatically for
me", that would be awesome.

In terms of resources, though, is 90 indices == 90 shards?

BTW, you can have a look at the new TTL field :Elasticsearch Platform — Find real-time answers at scale | Elastic

Yeah, cool, but this is for purging documents older than TTL, if I
understand correctly.

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or
"Hey, I'm interested in searching only shards that contain documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Thanks,
Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

Le 5 décembre 2011 à 22:36, "Rafał Kuć" rafal....@gmail.com a écrit :

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

--
David Pilatohttp://dev.david.pilato.fr/
Twitter : @dadoonet

colinsurprenant · December 6, 2011, 6:29pm

Hi,

I've had a similar conversation some time ago which should help you
figure this out:

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal.kuc@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

otisg · December 7, 2011, 2:22pm

Hi,

On Dec 6, 1:29 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

I've had a similar conversation some time ago which should help you
figure this out:https://mail.google.com/mail/?ui=2&shva=1#search/label%3Alist%3Aelast...

Thanks Colin.
The URL is: http://groups.google.com/a/elasticsearch.com/group/users/browse_frm/thread/faf866dc58875bea?tvc=1&q=surprenant+rolling

Unfortunately, this doesn't really touch on the questions I asked
except that it confirms 90 shards is the same as 90 indexes (assuming
no replication) as far as resources go.

Thanks anyway!

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal....@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

Michael_Sick · December 7, 2011, 2:30pm

Colin++,

So is the right thing here some type of index manager plugin that takes a
set of rules/thresholds and does the daily upkeep of an index? This seems
to be a common use case.

--Mike

On Wed, Dec 7, 2011 at 9:22 AM, Otis Gospodnetic <otis.gospodnetic@gmail.com

wrote:

Hi,

On Dec 6, 1:29 pm, Colin Surprenant colin.surpren...@gmail.com
wrote:

Hi,

I've had a similar conversation some time ago which should help you
figure this out:
https://mail.google.com/mail/?ui=2&shva=1#search/label%3Alist%3Aelast...

Thanks Colin.
The URL is:
http://groups.google.com/a/elasticsearch.com/group/users/browse_frm/thread/faf866dc58875bea?tvc=1&q=surprenant+rolling

Unfortunately, this doesn't really touch on the questions I asked
except that it confirms 90 shards is the same as 90 indexes (assuming
no replication) as far as resources go.

Thanks anyway!

Otis

Sematext is Hiring World-Wide -- Jobs - Sematext

Here we index between 10-20M documents per day and we've used both
weekly and daily indices strategies (with various sharding/replicas
depending on cluster size). We are currently using daily indices
because we need the finer granularity of choosing the required indices
at a day level. We also keep a bit more than a month worth of live
indices but rarely span our search requests over more than a few days.

Since our search requests are usually bounded in a date range which
can vary from request to request, aliasing was not very useful for us.
Instead, I've decided to name the daily indices using a pattern that
includes the day-of-year and create an index selector which output the
required indices for a given date range.

Hope this helps,
Colin

On Mon, Dec 5, 2011 at 4:36 PM, Rafał Kuć rafal....@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in
the
project I'm currently working on. We want to divide data by date
because of
the amount of data we have. We need to keep the data 'alive' for 30
days,
after that we don't need it and we can abandon it. So I was thinking
about
creating an index (with multiple shards) for each day. Alll indexes
older
than 30 days would be deleted. I was also thinking about using the
index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way
to
shard the data by time ?

Thanks for reply,
Rafał Kuć

kimchy · December 7, 2011, 4:01pm

Heya,

Yea, that sounds like the right approach. You will probably need only 1
shard for each daily index (the lower number of shards per daily index, the
less resources / nodes you will need). As people suggested on a thread that
spawned a bit from answering your question, deleting an index is much more
lightweight operation compared to deleting documents.

Things that can simplify usage can be aliases, and index templates
(which allow you to create custom configuration / mappings templates that
apply to indices created).

-shay.banon

On Mon, Dec 5, 2011 at 11:36 PM, Rafał Kuć rafal.kuc@gmail.com wrote:

Hello,

I've got a question about the design of Elasticsearch deployment in the
project I'm currently working on. We want to divide data by date because of
the amount of data we have. We need to keep the data 'alive' for 30 days,
after that we don't need it and we can abandon it. So I was thinking about
creating an index (with multiple shards) for each day. Alll indexes older
than 30 days would be deleted. I was also thinking about using the index
aliasing for hiding the details about multiple indexes.

The question is - is it the right approach or is there a better way to
shard the data by time ?

Thanks for reply,
Rafał Kuć

Berkay_Mollamustafao · December 7, 2011, 4:56pm

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Rafal_Kuc · December 7, 2011, 10:20pm

Hello,

Thanks for all the answers

Regards
Rafał Kuć

Michael_Sick · December 7, 2011, 10:34pm

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mberkay@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Karussell1 · December 7, 2011, 11:42pm

Here is one approach

github.com/elastic/elasticsearch

Convenient rolling index method

opened 09:24PM - 25 Nov 11 UTC

closed 03:51PM - 26 May 14 UTC

karussell

Here is some code where a rolling index pattern is implemented. Imagine you have… a logical index named 'tweets', now you want to create every day a new index to keep the indices small (Its a better scalable 'sharding', but only if you have time dependent data). Now, in the proposed code you will have to call rollIndex(maximumIndices) once a day. Then the new indices are all 'tagged' with a 'tweets_roll' alias (for later retrieval -> imrovable?), there is a group of indices for searching (tweets_search) and feeding (tweets_feed). Per default it creates a search alias on all indices and a feed alias only for the very latest. It separates the search and the roll alias as it could be the case that one wants to keep some older indices but do not want to search on it. What do you think? Its rather simple but it works - a simple test below the code. ``` private static final String simpleDateString = "yyyy-MM-dd-HH-mm-ss"; public String rollIndex(int maxRollIndices) { return rollIndex(getIndexName(), maxRollIndices, maxRollIndices); } public String rollIndex(String indexName, int maxRollIndices, int maxSearchIndices) { String rollAlias = indexName + "_roll"; SimpleDateFormat formatter = new SimpleDateFormat(simpleDateString); if (maxRollIndices < 1 || maxSearchIndices < 1) throw new RuntimeException("remaining indices, search indices and feeding indices must be at least 1"); // get old aliases Map<String, AliasMetaData> allRollingAliases = getAliases(rollAlias); // always create new index and append aliases String searchAlias = getSearchIndexName(); String feedAlias = getFeedIndexName(); String newIndexName = indexName + "_" + formatter.format(new Date()); createIndex(newIndexName); addAlias(newIndexName, searchAlias); addAlias(newIndexName, rollAlias); String oldFeedIndexName = null; if (allRollingAliases.isEmpty()) { // do nothing for now } else { TreeMap<Long, String> sortedIndices = new TreeMap<Long, String>(reverseSorter); String[] concreteIndices = getConcreteIndices(allRollingAliases.keySet()); //logger.info("aliases:" + allRollingAliases + ", indices:" + Arrays.toString(concreteIndices)); for (String index : concreteIndices) { int pos = index.indexOf("_"); if (pos < 0) throw new IllegalStateException("index " + index + " is not in the format " + simpleDateString); String indexDateStr = index.substring(pos + 1); Long timeLong; try { timeLong = formatter.parse(indexDateStr).getTime(); } catch (Exception ex) { throw new IllegalStateException("index " + index + " is not in the format " + simpleDateString + " error:" + ex.getMessage()); } String old = sortedIndices.put(timeLong, index); if (old != null) throw new IllegalStateException("indices with the identical date are not supported " + old + " vs. " + index); } int counter = 1; Iterator<String> indexIter = sortedIndices.values().iterator(); while (indexIter.hasNext()) { String currentIndexName = indexIter.next(); if (counter >= maxRollIndices) { deleteIndex(currentIndexName); // delete all the older indices continue; } if (counter == 1) oldFeedIndexName = currentIndexName; if (counter >= maxSearchIndices) removeAlias(currentIndexName, searchAlias); counter++; } } if(oldFeedIndexName != null) moveAlias(oldFeedIndexName, newIndexName, feedAlias); else addAlias(newIndexName, feedAlias); return newIndexName; } public String getSearchIndexName() { return getIndexName() + "_search"; } public String getFeedIndexName() { return getIndexName() + "_feed"; } public void createIndex(String indexName) { client.admin().indices().create(new CreateIndexRequest(indexName).settings(createIndexSettings())).actionGet(timeout); } public XContentBuilder createIndexSettings() { if (createIndexSettings == null) { try { createIndexSettings = JsonXContent.contentBuilder().startObject(). field("index.number_of_shards", createIndexShards). field("index.number_of_replicas", createIndexReplicas). field("index.refresh_interval", "10s"). field("index.merge.policy.merge_factor", 10).endObject(); } catch (IOException ex) { throw new RuntimeException(ex); } } return createIndexSettings; } public void deleteIndex(String indexName) { client.admin().indices().delete(new DeleteIndexRequest(indexName)).actionGet(); } public void addAlias(String indexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().addAlias(indexName, alias)).actionGet(); } public void removeAlias(String indexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().removeAlias(indexName, alias)).actionGet(); } public void moveAlias(String oldIndexName, String newIndexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().addAlias(newIndexName, alias). removeAlias(oldIndexName, alias)).actionGet(); } public Map<String, AliasMetaData> getAliases(String index) { Map<String, AliasMetaData> md = client.admin().cluster().state(new ClusterStateRequest()). actionGet().getState().getMetaData().aliases().get(index); if (md == null) return Collections.emptyMap(); return md; } private static Comparator<Long> reverseSorter = new Comparator<Long>() { @Override public int compare(Long o1, Long o2) { return -o1.compareTo(o2); } }; public String[] getConcreteIndices(Set<String> set) { return client.admin().cluster().state(new ClusterStateRequest()).actionGet().getState(). getMetaData().concreteIndices(set.toArray(new String[set.size()])); } ``` TEST ``` @Test public void rollingIndex() throws Exception { search.setClient(createTestClient()); search.setIndexName("tweets"); String rollIndexTag = search.getIndexName() + "_roll"; String searchIndex = search.getIndexName() + "_search"; String feedIndex = search.getIndexName() + "_feed"; search.rollIndex(4); assertEquals(1, search.getAliases(rollIndexTag).size()); assertEquals(1, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(2, search.getAliases(rollIndexTag).size()); assertEquals(2, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(3, search.getAliases(rollIndexTag).size()); assertEquals(3, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(4, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(4, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(search.getIndexName(), 4, 3); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(3, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); } ```

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

dottom · December 8, 2011, 1:09am

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Michael_Sick · December 8, 2011, 1:12am

Tom,

Interesting profile. What are you server specs? What's your query load like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain
documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

dottom · December 8, 2011, 9:52am

Interesting profile. What are you server specs?

The 2 billion rows was tested on: 2U server, 4 7200RPM disks RAID-5, 2
CPU's (quad-core), 16-gb memory (8-gb heap size).

The 20 billion row test will be on slightly beefier: 2U server, 5 7200RPM
disks RAID-5, 2 CPU's (quad-core), 24-gb memory (12gb or 16-gb heap size)

What's your query load like?

Inserting at a sustained 3000 docs/sec, with bursts up to 6000 docs/sec.
Takes between 7-10 days to load 2 billion docs. I am testing multiple
indexers to see if I can increase indexing rate to 20k/sec so a large test
will be easier.

This is all done on a single server rather than distributed model, which of
course would allow even better performance. I haven't done the RAID-10 vs.
RAID-5 tests yet. We need to maximize disk capacity in a single physical
enclosure so use the biggest disks available, which means you have to deal
with slower speeds at 7200RPM and RAID-5.

Our environment is write intensive. Queries are relatively infrequent,
perhaps 1000 per day maximum. This probably means we should not be using
RAID-5 which has write penalty but no read penalty, but we want the disk
capacity. Since 6000 msgs/sec is sufficient for us at present, I haven't
done much testing on disk write performance.

For the queries themselves, Like with any indexing app, queries spanning
lots of data always take a long time. On the mailing list was discussed
how a query could not be "cancelled", for example, user makes a mistake and
queries data for last 10 months instead of last 10 days, and wants to
cancel the query. In our app, having lots of key/value pairs helps makes
the queries much faster. We can also tolerate a lag of at least 30 seconds
from when documents are inserted to when they are available in a query,
perhaps even longer but we flush at 30-seconds.

I'm working on another project using 100 Amazon EC2 micro instances, but
first need to build the automation layer as I don't want to administer 100
instances. In this case I need the distributed read performance and will
be testing HBase, MongoDB, and ES in separate trials. I am not sure the
micro instances give me enough memory, but am trying to build something
relatively inexpensively. Someone suggested IRC chat, we should have a
weekly hour chat session.

On Wed, Dec 7, 2011 at 5:12 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:

Tom,

Interesting profile. What are you server specs? What's your query load
like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com wrote:

Here is one approach

Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC
in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in
the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it
can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so
please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain
documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

Michael_Sick · December 8, 2011, 4:43pm

Hi Tom,

Thanks for all of the detail. I have some questions that you and Shay could
shed some light on.

On Thu, Dec 8, 2011 at 4:52 AM, Tom Le dottom@gmail.com wrote:

Interesting profile. What are you server specs?

The 2 billion rows was tested on: 2U server, 4 7200RPM disks RAID-5, 2
CPU's (quad-core), 16-gb memory (8-gb heap size).

In this config or the one below, how do you get around having a single
node? Is your data stored elsewhere? Even so, a 7-day window to recreate
would be a pain.

The 20 billion row test will be on slightly beefier: 2U server, 5 7200RPM
disks RAID-5, 2 CPU's (quad-core), 24-gb memory (12gb or 16-gb heap size)

What's your query load like?

Inserting at a sustained 3000 docs/sec, with bursts up to 6000 docs/sec.
Takes between 7-10 days to load 2 billion docs. I am testing multiple
indexers to see if I can increase indexing rate to 20k/sec so a large test
will be easier.

This is all done on a single server rather than distributed model, which
of course would allow even better performance. I haven't done the RAID-10
vs. RAID-5 tests yet. We need to maximize disk capacity in a single
physical enclosure so use the biggest disks available, which means you have
to deal with slower speeds at 7200RPM and RAID-5.

Our environment is write intensive. Queries are relatively infrequent,

perhaps 1000 per day maximum. This probably means we should not be using
RAID-5 which has write penalty but no read penalty, but we want the disk
capacity. Since 6000 msgs/sec is sufficient for us at present, I haven't
done much testing on disk write performance.

Do you know what's limiting your write speed (CPU, I/O, ...)? Do you think
you'd get better I/O using JBOD vs. RAID?

Shay - even on a single machine, will ES ensure replicas are on different
disks if given multiple paths for the index?

For the queries themselves, Like with any indexing app, queries spanning
lots of data always take a long time. On the mailing list was discussed
how a query could not be "cancelled", for example, user makes a mistake and
queries data for last 10 months instead of last 10 days, and wants to
cancel the query. In our app, having lots of key/value pairs helps makes
the queries much faster. We can also tolerate a lag of at least 30 seconds
from when documents are inserted to when they are available in a query,
perhaps even longer but we flush at 30-seconds.

I'm working on another project using 100 Amazon EC2 micro instances, but
first need to build the automation layer as I don't want to administer 100
instances. In this case I need the distributed read performance and will
be testing HBase, MongoDB, and ES in separate trials. I am not sure the
micro instances give me enough memory, but am trying to build something
relatively inexpensively. Someone suggested IRC chat, we should have a
weekly hour chat session.

Well you can start with micro and see how the various configurations work
for you. I've been testing Cassandra vs. HBase and find it competitive as
well. Basically I'm going to be writing to either HBase or Cassandra if a
client wants a write behind data store and then syncing to ES. There are
many good things about HBase but I would not make the claim that it's easy
to administer.

On Wed, Dec 7, 2011 at 5:12 PM, Michael Sick <
michael.sick@serenesoftware.com> wrote:

Tom,

Interesting profile. What are you server specs? What's your query load
like?

On Wed, Dec 7, 2011 at 8:09 PM, Tom Le dottom@gmail.com wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" tableyourtime@googlemail.com
wrote:

Here is one approach

Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC
in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in
the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it
can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so
please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain
documents
added in the last N days, so please manage the index alias for me
to
point to only last N shards"

I think neither is doable today, other than managing the index
alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

kaush777 · May 3, 2013, 10:42am

Hi Tom,

I wanted to know how you created daily indices. And also wanted to ask that
if i want to keep only 30 days data. how can i go about it? I do not want
to manually delete the old indices after 30 days.
On Thursday, 8 December 2011 06:39:19 UTC+5:30, Tom Le wrote:

We have inserted 2 billion documents, one index per day, one server, no
replicas, 500 days (= 500 indexes = 500 shards). Average document size 600
bytes (source and all key/value pairs), compression enable when source >
500 bytes. The only issues we had were memory consumption which was
resolving by adjusting max segment size, and disk usage which is not stored
compressed like with commercial solutions.

Am currently testing 20 billion documents. This is all on a single
server.
On Dec 7, 2011 3:42 PM, "Karussell" <tabley...@googlemail.com<javascript:>>
wrote:

Here is one approach

Convenient rolling index method · Issue #1500 · elastic/elasticsearch · GitHub

Would be nice as a plugin though ...

Regards,
Peter.

On 7 Dez., 23:34, Michael Sick michael.s...@serenesoftware.com
wrote:

Anyone interested in kicking around some requirements/thoughts on IRC in
the coming days on what's needed here? Have to build some of this and
would
be happy to write as a plugin.

On Wed, Dec 7, 2011 at 11:56 AM, Berkay Mollamustafaoglu
mber...@gmail.comwrote:

I think you're right. ES does not do this and it has to be done in the
external app, but it's probably better to do it that way.
ES provides the APIs and the external app can do all the index
management,
from creating/assigning aliases, to
creating/opening/closing/deleting indices. I think the external app
can
even go further and keep some meta data about the indices. For
example, in
the case of time based indices like creating an index per day, etc.
external app can track start/end date and time for docs so that it can
determine which indices to run the query against, which index to
reopen if
the index was closed, etc.

I'd imagine that there are many different use cases hence it's
probably
better to keep these type of capabilities out of ES.

Berkay

What might be nice to add to ES is the ability to say:

"Hey, I'm interested in searching only the last N shards, so please
manage the index alias for me to point to only last N shards, so I
don't have to manage this from my app"
or

"Hey, I'm interested in searching only shards that contain
documents
added in the last N days, so please manage the index alias for me to
point to only last N shards"

I think neither is doable today, other than managing the index alias
and its mapping to appropriate indices or shards manually, with an
external app that talks to ES to add/remove indices from the alias?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Recommended way to reduce overload on ES Elasticsearch	10	3694	July 6, 2017
How many indices ES can support Elasticsearch	6	356	July 6, 2017
Questions from a newbie Elasticsearch	15	440	July 6, 2017
Is shard splitting supported in Elastic search, any alternate Elasticsearch	9	443	July 6, 2017
Decrease "Real time" latency for large indices Elasticsearch	9	399	July 6, 2017

Sharding by time

Thanks, Otis

Otis

Otis

Related topics

Thanks,
Otis