"Rivers" & Flume

I need to drill into the sources and docs about the new River stuff, but I'm
very intrigued, I just wanted to sound out whether I'm on the right track
here and whether Cloudera's Flume may also be an integration point.

From the posts I've read, I can see how rabbitmq et al can be used as a
producer/consumer pipeline pushing updates into ES. I'm thinking for a use
case we have with a tertiary data centre on the other side of the planet,
instead of writing messages, simply write them to a log file, and have Flume
reliably deliver these from multiple hosts over to the tertiary and have
something like my elasticflume Sink apply them. Is this sort of what Rivers
are all about? Source/Sink pattern?

Paul

The idea of "rivers" are elements that are managed within the elasticsearch
cluster. There will be only a single river running (per "name") within the
cluster, and if the node it runs on fails, then that river will be created
on another node.

A river make sense when you need to write the "bridging" code yourself that
pulls/gets_pushed data and applies it to elasticsearch. The best example is
rabbitmq (or a messaging system), where anyhow you would need to run
something that listens for messages from the queue and apply those messages
to elasticsearch. In this case, it make sense to run this within the
elasticsearch cluster benefitting mainly from the fact that it will only run
when the cluster is up, and its support for failover of rivers. Twitter
stream is another good example for a river.

As for flume, I am not too familiar with it. It looks like its something
that you manage yourself (flume instances?), and they basically push data
into sinks. With this model, it does not make much sense to have flume as a
river.

I think a good example that covers both bases and might explain this is
couchdb. In couchdb, you can register for a stream of _changes happening on
it. In this case, a couchdb river can be written that would register for
changes and apply them. On the other hand, you can write hooks into couchdb
that will be called when things change from within couchdb. In this
scenario, a river does not really make sense, as simply plugging a hook that
calles elasticsearch is a good solution.

It feels like flume is applies to the second couchdb case.

-shay.banon

On Wed, Sep 22, 2010 at 1:02 AM, Paul Smith tallpsmith@gmail.com wrote:

I need to drill into the sources and docs about the new River stuff, but
I'm very intrigued, I just wanted to sound out whether I'm on the right
track here and whether Cloudera's Flume may also be an integration point.

From the posts I've read, I can see how rabbitmq et al can be used as a
producer/consumer pipeline pushing updates into ES. I'm thinking for a use
case we have with a tertiary data centre on the other side of the planet,
instead of writing messages, simply write them to a log file, and have Flume
reliably deliver these from multiple hosts over to the tertiary and have
something like my elasticflume Sink apply them. Is this sort of what Rivers
are all about? Source/Sink pattern?

Paul