So as Rafal mentioned, this is not about monitoring.
Here's the situation:
We are building a massive ES cluster (hundreds of big servers,
possibly even over 1000 of them on day one). The volume of data is
huge and it just keeps coming. Thus, the cluster just needs to keep
expanding and expanding. We need to be able to search the last N days
(e.g. 30) worth of content, but also the whole index. Because of the
scale of things we want to make sure we maximize hardware utilization
by spreading everything veeeeeeery evenly.
The way to visualize this is as a grid where rows represent days 1-30
and columns represent shards and replicas.
Assuming a single server can hold half a day's worth of data, each day
we'll create a new index with 2 shards.
Each of these shards will live on a separate box.
In addition, we'll have N replicas, say a replica for each of those 2
You can visualize the above as a row of 4 servers: 2 servers for 2
shards, and 1 server for a replica of each of the 2 shards - 4 servers
So on day 1 we'll have a row of 4 boxes.
On day 2 another row of 4 boxes.
and so on...
Because data is coming in continuously, we thought it may be best to
have a separate "process" (separate from the indexer) that pre-creates
indices on specific machines (using shard/replica include/exclude
allocation tags). For example, some time on Monday it would pre-
create an index on the row of boxes that we want holding Tuesday's
index. This really could be a completely standalone/separate process
running on one of the ES nodes or externally, but then we have SPOF,
so we thought we'd implement this as a River, so that ES can manage it
and thus eliminate this SPOF.
All Rivers I see on ES site are for indexing content.
So I think the question is whether Rivers are purely meant for
indexing or if they can be used for anything one wants to run within
On Feb 9, 3:25 am, Shay Banon kim...@gmail.com wrote:
Can you explain a bit more what it does? You mean cyclical operations on the cluster as in repeating operations, like gathering stats? If so, then I would implement it differently. Node level stats can be reported for each node as a simple standalone (node level) service, and index level stats can be gathered from the master node (you can tell if the node is master or not by registering for cluster events).
On Wednesday, February 8, 2012 at 10:51 PM, Rafał Kuć wrote:
We developed a custom river plugin, to make some cyclical operations
on the cluster. What I'm worried about is that rivers are designed for
data indexation or at least that's what the Elasticsearch
documentation says about them. My question is if the approach we took
is a good one or should we change it ? One more thing - we need our
plugin to be singleton inside a cluster (just like river is), not to
do the same things again and again
Sematext ::http://sematext.com/:: Solr - Lucene - Nutch