RFE: Making Rivers more useful

derrickburns · January 18, 2012, 2:37am

My understanding of a elasticsearch River is that is simply pulls data into
a elasticsearch index, but does nothing to manage the storage of the index
that it is filling.

I propose that a River model an ever growing index with elastic storage and
elastic compute when deployed in a cloud environment. As the River
receives data, it puts the data into the current index, until the current
index reaches a certain size. When the index reaches a certain size, a new
index is opened.

A new index, like any index, can have a given fixed number of shards. When
a River allocates a new index, it also distributes the new shards to nodes.
A River could be configured to allocate new nodes and new storage, when on
a cloud. On AWS, the opening of a new index might be preceded by the
allocation of new EC2 nodes and new EBS volumes.

Finally, a River has a current index alias that lists the indices that are
searched when a query is received. When a new index is added to a river, it
is appended to the current index alias.

The current alias could be parameterized by time, meaning its content, or
list of indices, could be filtered by the "current alias filter." The
current alias filter would specify a set of constraints, say a "maximum
age" and a date field to use to ascertain document age. If an index has
not documents that that are younger than the given maximum age, the index
is taken off the alias list, and optionally, closed.

Perhaps the River already does this, or is already conceived in this
manner.

Thoughts Shay?

Karussell1 · January 18, 2012, 9:17am

Here is some code for the rolling index part:

github.com/elastic/elasticsearch

Convenient rolling index method

opened 09:24PM - 25 Nov 11 UTC

closed 03:51PM - 26 May 14 UTC

karussell

Here is some code where a rolling index pattern is implemented. Imagine you have… a logical index named 'tweets', now you want to create every day a new index to keep the indices small (Its a better scalable 'sharding', but only if you have time dependent data). Now, in the proposed code you will have to call rollIndex(maximumIndices) once a day. Then the new indices are all 'tagged' with a 'tweets_roll' alias (for later retrieval -> imrovable?), there is a group of indices for searching (tweets_search) and feeding (tweets_feed). Per default it creates a search alias on all indices and a feed alias only for the very latest. It separates the search and the roll alias as it could be the case that one wants to keep some older indices but do not want to search on it. What do you think? Its rather simple but it works - a simple test below the code. ``` private static final String simpleDateString = "yyyy-MM-dd-HH-mm-ss"; public String rollIndex(int maxRollIndices) { return rollIndex(getIndexName(), maxRollIndices, maxRollIndices); } public String rollIndex(String indexName, int maxRollIndices, int maxSearchIndices) { String rollAlias = indexName + "_roll"; SimpleDateFormat formatter = new SimpleDateFormat(simpleDateString); if (maxRollIndices < 1 || maxSearchIndices < 1) throw new RuntimeException("remaining indices, search indices and feeding indices must be at least 1"); // get old aliases Map<String, AliasMetaData> allRollingAliases = getAliases(rollAlias); // always create new index and append aliases String searchAlias = getSearchIndexName(); String feedAlias = getFeedIndexName(); String newIndexName = indexName + "_" + formatter.format(new Date()); createIndex(newIndexName); addAlias(newIndexName, searchAlias); addAlias(newIndexName, rollAlias); String oldFeedIndexName = null; if (allRollingAliases.isEmpty()) { // do nothing for now } else { TreeMap<Long, String> sortedIndices = new TreeMap<Long, String>(reverseSorter); String[] concreteIndices = getConcreteIndices(allRollingAliases.keySet()); //logger.info("aliases:" + allRollingAliases + ", indices:" + Arrays.toString(concreteIndices)); for (String index : concreteIndices) { int pos = index.indexOf("_"); if (pos < 0) throw new IllegalStateException("index " + index + " is not in the format " + simpleDateString); String indexDateStr = index.substring(pos + 1); Long timeLong; try { timeLong = formatter.parse(indexDateStr).getTime(); } catch (Exception ex) { throw new IllegalStateException("index " + index + " is not in the format " + simpleDateString + " error:" + ex.getMessage()); } String old = sortedIndices.put(timeLong, index); if (old != null) throw new IllegalStateException("indices with the identical date are not supported " + old + " vs. " + index); } int counter = 1; Iterator<String> indexIter = sortedIndices.values().iterator(); while (indexIter.hasNext()) { String currentIndexName = indexIter.next(); if (counter >= maxRollIndices) { deleteIndex(currentIndexName); // delete all the older indices continue; } if (counter == 1) oldFeedIndexName = currentIndexName; if (counter >= maxSearchIndices) removeAlias(currentIndexName, searchAlias); counter++; } } if(oldFeedIndexName != null) moveAlias(oldFeedIndexName, newIndexName, feedAlias); else addAlias(newIndexName, feedAlias); return newIndexName; } public String getSearchIndexName() { return getIndexName() + "_search"; } public String getFeedIndexName() { return getIndexName() + "_feed"; } public void createIndex(String indexName) { client.admin().indices().create(new CreateIndexRequest(indexName).settings(createIndexSettings())).actionGet(timeout); } public XContentBuilder createIndexSettings() { if (createIndexSettings == null) { try { createIndexSettings = JsonXContent.contentBuilder().startObject(). field("index.number_of_shards", createIndexShards). field("index.number_of_replicas", createIndexReplicas). field("index.refresh_interval", "10s"). field("index.merge.policy.merge_factor", 10).endObject(); } catch (IOException ex) { throw new RuntimeException(ex); } } return createIndexSettings; } public void deleteIndex(String indexName) { client.admin().indices().delete(new DeleteIndexRequest(indexName)).actionGet(); } public void addAlias(String indexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().addAlias(indexName, alias)).actionGet(); } public void removeAlias(String indexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().removeAlias(indexName, alias)).actionGet(); } public void moveAlias(String oldIndexName, String newIndexName, String alias) { client.admin().indices().aliases(new IndicesAliasesRequest().addAlias(newIndexName, alias). removeAlias(oldIndexName, alias)).actionGet(); } public Map<String, AliasMetaData> getAliases(String index) { Map<String, AliasMetaData> md = client.admin().cluster().state(new ClusterStateRequest()). actionGet().getState().getMetaData().aliases().get(index); if (md == null) return Collections.emptyMap(); return md; } private static Comparator<Long> reverseSorter = new Comparator<Long>() { @Override public int compare(Long o1, Long o2) { return -o1.compareTo(o2); } }; public String[] getConcreteIndices(Set<String> set) { return client.admin().cluster().state(new ClusterStateRequest()).actionGet().getState(). getMetaData().concreteIndices(set.toArray(new String[set.size()])); } ``` TEST ``` @Test public void rollingIndex() throws Exception { search.setClient(createTestClient()); search.setIndexName("tweets"); String rollIndexTag = search.getIndexName() + "_roll"; String searchIndex = search.getIndexName() + "_search"; String feedIndex = search.getIndexName() + "_feed"; search.rollIndex(4); assertEquals(1, search.getAliases(rollIndexTag).size()); assertEquals(1, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(2, search.getAliases(rollIndexTag).size()); assertEquals(2, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(3, search.getAliases(rollIndexTag).size()); assertEquals(3, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(4, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(4); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(4, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); Thread.sleep(1000); search.rollIndex(search.getIndexName(), 4, 3); assertEquals(4, search.getAliases(rollIndexTag).size()); assertEquals(3, search.getAliases(searchIndex).size()); assertEquals(1, search.getAliases(feedIndex).size()); } ```

Peter.

On 18 Jan., 03:37, Derrick derrickrbu...@gmail.com wrote:

My understanding of a elasticsearch River is that is simply pulls data into
a elasticsearch index, but does nothing to manage the storage of the index
that it is filling.

I propose that a River model an ever growing index with elastic storage and
elastic compute when deployed in a cloud environment. As the River
receives data, it puts the data into the current index, until the current
index reaches a certain size. When the index reaches a certain size, a new
index is opened.

A new index, like any index, can have a given fixed number of shards. When
a River allocates a new index, it also distributes the new shards to nodes.
A River could be configured to allocate new nodes and new storage, when on
a cloud. On AWS, the opening of a new index might be preceded by the
allocation of new EC2 nodes and new EBS volumes.

Finally, a River has a current index alias that lists the indices that are
searched when a query is received. When a new index is added to a river, it
is appended to the current index alias.

The current alias could be parameterized by time, meaning its content, or
list of indices, could be filtered by the "current alias filter." The
current alias filter would specify a set of constraints, say a "maximum
age" and a date field to use to ascertain document age. If an index has
not documents that that are younger than the given maximum age, the index
is taken off the alias list, and optionally, closed.

Perhaps the River already does this, or is already conceived in this
manner.

Thoughts Shay?