RFE: Making Rivers more useful


(derrickburns) #1

My understanding of a elasticsearch River is that is simply pulls data into
a elasticsearch index, but does nothing to manage the storage of the index
that it is filling.

I propose that a River model an ever growing index with elastic storage and
elastic compute when deployed in a cloud environment. As the River
receives data, it puts the data into the current index, until the current
index reaches a certain size. When the index reaches a certain size, a new
index is opened.

A new index, like any index, can have a given fixed number of shards. When
a River allocates a new index, it also distributes the new shards to nodes.
A River could be configured to allocate new nodes and new storage, when on
a cloud. On AWS, the opening of a new index might be preceded by the
allocation of new EC2 nodes and new EBS volumes.

Finally, a River has a current index alias that lists the indices that are
searched when a query is received. When a new index is added to a river, it
is appended to the current index alias.

The current alias could be parameterized by time, meaning its content, or
list of indices, could be filtered by the "current alias filter." The
current alias filter would specify a set of constraints, say a "maximum
age" and a date field to use to ascertain document age. If an index has
not documents that that are younger than the given maximum age, the index
is taken off the alias list, and optionally, closed.

Perhaps the River already does this, or is already conceived in this
manner.

Thoughts Shay?


(Karussell) #2

Here is some code for the rolling index part:

Peter.

On 18 Jan., 03:37, Derrick derrickrbu...@gmail.com wrote:

My understanding of a elasticsearch River is that is simply pulls data into
a elasticsearch index, but does nothing to manage the storage of the index
that it is filling.

I propose that a River model an ever growing index with elastic storage and
elastic compute when deployed in a cloud environment. As the River
receives data, it puts the data into the current index, until the current
index reaches a certain size. When the index reaches a certain size, a new
index is opened.

A new index, like any index, can have a given fixed number of shards. When
a River allocates a new index, it also distributes the new shards to nodes.
A River could be configured to allocate new nodes and new storage, when on
a cloud. On AWS, the opening of a new index might be preceded by the
allocation of new EC2 nodes and new EBS volumes.

Finally, a River has a current index alias that lists the indices that are
searched when a query is received. When a new index is added to a river, it
is appended to the current index alias.

The current alias could be parameterized by time, meaning its content, or
list of indices, could be filtered by the "current alias filter." The
current alias filter would specify a set of constraints, say a "maximum
age" and a date field to use to ascertain document age. If an index has
not documents that that are younger than the given maximum age, the index
is taken off the alias list, and optionally, closed.

Perhaps the River already does this, or is already conceived in this
manner.

Thoughts Shay?


(system) #3