For rivers, the singleton in the plugin is helpful to guarantee no data is
indexed twice. There is a switchover construction when the node fails while
a river is running. It should switch over, but the internal river state can
not automatically be detected by ES, and is rather undefined in such case.
Each river plugin needs to implement a mechanism of rollback after a
switchover (by a support index where the last state is stored e.g.)
So the first challenge for a singleton is to ensure no event is dropped
while performing and resuming from a switchover.
Moreover, a singleton will soon become a bottleneck. Such an instance needs
to examine the whole cluster for each and every single event occurred. Most
events appear per index/shard per node, so the growth is more than linear.
A better solution for pushing events would be distributed "event pumps" or
event sources right at the place where events do appear.
The few global cluster events of interest are rare (node join/split,
mapping updates), they will appear at the master node. Many events are node
related events (stats and info about indexes and shards), they should be
created at each node, and transported through a suitable channel node
network exactly to the node that has clients subscribed to that kind of
events (where websockets come into play).
On Monday, September 10, 2012 12:44:00 AM UTC+2, Vinicius Carvalho wrote:
Hi Jorg, thanks for the help. But as I said we want the daemon to run from
inside ES, that's why rivers seemed to me a bit more appropriate. Maybe one
day we may have a cluster wide singleton plugin and I could move that
direction. I think I'll follow Otis advice and consider the river something
more than just a component to index inside ES. BTW I'm using your awesome
jdbc river as a template to get mine running
On Sunday, September 9, 2012 4:58:25 AM UTC-4, Jörg Prante wrote:
rivers are always for indexing from external sources, see
To solve the problem with polling, I implemented a websocket transport
plugin. Websockets are full-duplex communication channels. Such channels
allow pushing events from Elasticsearch to clients without having the
client initiated the request.
With websockets, distributed asynchronous client/server architectures are
possible, also known as pubsub (publish/subscribe). See also ticket
I'm in the process of completing it and adding features, a tutorial is
On Sunday, September 9, 2012 4:18:11 AM UTC+2, Vinicius Carvalho wrote:
My question is simple: Is it ok to run a river to dump information from
ES to other places?
Bellow is the rationale behind it:
Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.
I know this sounds silly, but it's just that seems that Rivers should
only be used to index data, and I'm trying to do the opposite get stats
data and send elsewhere. I know I could create a daemon elsewhere to poll
for this data, but I think that will be simpler to have this builtin our ES
I tried bigdesk (could not get it to work though) but we also need to
have the information persisted, I would gladly integrate with bigdesk if
that's the case.