How to ensure that rivers are equally distributed across nodes in the cluster

Karol_Gwaj · December 26, 2013, 3:35pm

yep it sounds great, cant wait to see some beta version to play with

i gave a quick look into logstash, but it is not exactly what i want
and feels too 'resources heavy' for me to install additional framework on
every node (or have dedicated nodes for it)

will be nice if elasticsearch team could extract river functionality into
some kind of plugin and contribute it to community (before deprecating it)
so if someone still wants to use rivers, they will be able too

Cheers,
Karol

On Thursday, December 26, 2013 12:37:36 PM UTC, Jörg Prante wrote:

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

it can be installed on one or more nodes and provides a common bulk
indexing framework

a gatherer plugin registers in the cluster state (on node level)

there are standard capabilities, but a gatherer plugin capability can be
extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)

a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command

if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)

a client can ask for the state of all the gatherers and all the job
results

all jobs can be partitioned and processed in parallel for maximum
throughput

the gatherer also creates metrics/statistics of the jobs successfully
done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c814b01-b09e-4974-aca4-0f8489933915%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Should rivers only index information? Elasticsearch	8	427	July 6, 2017
How to setup an ES-to-ES river? Elasticsearch	7	433	July 6, 2017
Elasticsearch and cassandra integration? Elasticsearch	14	4290	July 6, 2017
River on Cluster Elasticsearch	6	353	July 6, 2017
A Question on Plugin redundancy Elasticsearch	15	459	July 6, 2017

How to ensure that rivers are equally distributed across nodes in the cluster

Related topics