How to ensure that rivers are equally distributed across nodes in the cluster


(Karol Gwaj) #1

Hi,

is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will like
to avoid situation when they all run on this same node (i want to
distribute the load across cluster))

as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?

thx for any help,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0facc47e-39b3-4b10-a7eb-faf847516693%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #2

This is as close as you can get:
http://www.elasticsearch.org/guide/en/elasticsearch/rivers/current/index.html#allocation

On Saturday, December 21, 2013 5:25:18 PM UTC-5, Karol Gwaj wrote:

Hi,

is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))

as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?

thx for any help,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f1509ba9-6723-41ad-89b8-2ad3dc2bc983%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #3

Hey,

Rivers are designed to be a cluster singleton - so they only run on one
system. The client object is connected to the local node. It might make
more sense to have to have another component in front which aggregates your
data - maybe logstash, maybe self-written components...

--Alex

On Sat, Dec 21, 2013 at 11:25 PM, Karol Gwaj karol@gwaj.me wrote:

Hi,

is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))

as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?

thx for any help,

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0facc47e-39b3-4b10-a7eb-faf847516693%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_u5QeHxjPdmZ7Ph-s2BdM%2BzabY_-rq93pq08g5vmj3PQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

Distributing gatherers across the nodes of a cluster is far more like the
original river concept.

FYI I am working on an uber-plugin which can distribute river-like mini
jobs in a scatter/gather manner. Example: install one plugin, execute 100
SQL statements over the whole cluster, index docs in ES, and coordinate the
results. This is not really much fun with the JDBC river at at the moment.

My aim is to get independent of the river plugin method, because rivers are
going to be deprecated some day in the near future.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGVLjZT9UrjjiJ-2Md%3DZpiqtArvo313-t5OccGbcUvVmA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Karol Gwaj) #5

Hi Jörg,

i didnt know that river plugin will be deprecated,
suddenly i very interested in your uber-plugin :slight_smile: and the distributed
gatherers concept is exactly what i need
can you tell me more about this plugin ?
will it be distributed with elasticsearch or as community contribution ?
do you have already repository with some beta (or alpha) version that i can
eyeball?

thx,
Karol

On Wednesday, December 25, 2013 10:01:10 AM UTC, Jörg Prante wrote:

Distributing gatherers across the nodes of a cluster is far more like the
original river concept.

FYI I am working on an uber-plugin which can distribute river-like mini
jobs in a scatter/gather manner. Example: install one plugin, execute 100
SQL statements over the whole cluster, index docs in ES, and coordinate the
results. This is not really much fun with the JDBC river at at the moment.

My aim is to get independent of the river plugin method, because rivers
are going to be deprecated some day in the near future.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/118988ed-5a43-4225-86a3-ede48b506669%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

  • it can be installed on one or more nodes and provides a common bulk
    indexing framework

  • a gatherer plugin registers in the cluster state (on node level)

  • there are standard capabilities, but a gatherer plugin capability can be
    extended in a live cluster by submitting code for inputs, codecs, and
    filters, picked up by a custom class loader (for example, JDBC, and a
    driver jar, and tabular key/value output)

  • a gatherer plugin is idling, and accepts jobs in form of JSON commands
    (defining the selection of inputs, codecs, and filters), for example, an
    SQL command

  • if a gatherer is told to distribute the jobs fairly and is too busy
    (active job queue length), it forwards them to other gatherers (other
    methods are crontab-like scheduling), and the results of the jobs (ok,
    failed, retry) are registered also in the cluster state (maybe an internal
    index is better because there can be tens of thousands such jobs)

  • a client can ask for the state of all the gatherers and all the job
    results

  • all jobs can be partitioned and processed in parallel for maximum
    throughput

  • the gatherer also creates metrics/statistics of the jobs successfully done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF_c%2BwOBjYqovLv3BdFQGR0j3-faYq9OtXpeOsbfsA7Hw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Justin Doles) #7

Looking forward to seeing what you come up with Jörg. I can tell you've
been thinking on this for a while.

Justin

On Thursday, December 26, 2013 7:37:36 AM UTC-5, Jörg Prante wrote:

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

  • it can be installed on one or more nodes and provides a common bulk
    indexing framework

  • a gatherer plugin registers in the cluster state (on node level)

  • there are standard capabilities, but a gatherer plugin capability can be
    extended in a live cluster by submitting code for inputs, codecs, and
    filters, picked up by a custom class loader (for example, JDBC, and a
    driver jar, and tabular key/value output)

  • a gatherer plugin is idling, and accepts jobs in form of JSON commands
    (defining the selection of inputs, codecs, and filters), for example, an
    SQL command

  • if a gatherer is told to distribute the jobs fairly and is too busy
    (active job queue length), it forwards them to other gatherers (other
    methods are crontab-like scheduling), and the results of the jobs (ok,
    failed, retry) are registered also in the cluster state (maybe an internal
    index is better because there can be tens of thousands such jobs)

  • a client can ask for the state of all the gatherers and all the job
    results

  • all jobs can be partitioned and processed in parallel for maximum
    throughput

  • the gatherer also creates metrics/statistics of the jobs successfully
    done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/01653854-201f-40a8-bc39-3ea4afd754b7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Karol Gwaj) #8

yep it sounds great, cant wait to see some beta version to play with

i gave a quick look into logstash, but it is not exactly what i want
and feels too 'resources heavy' for me to install additional framework on
every node (or have dedicated nodes for it)

will be nice if elasticsearch team could extract river functionality into
some kind of plugin and contribute it to community (before deprecating it)
so if someone still wants to use rivers, they will be able too

Cheers,
Karol

On Thursday, December 26, 2013 12:37:36 PM UTC, Jörg Prante wrote:

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

  • it can be installed on one or more nodes and provides a common bulk
    indexing framework

  • a gatherer plugin registers in the cluster state (on node level)

  • there are standard capabilities, but a gatherer plugin capability can be
    extended in a live cluster by submitting code for inputs, codecs, and
    filters, picked up by a custom class loader (for example, JDBC, and a
    driver jar, and tabular key/value output)

  • a gatherer plugin is idling, and accepts jobs in form of JSON commands
    (defining the selection of inputs, codecs, and filters), for example, an
    SQL command

  • if a gatherer is told to distribute the jobs fairly and is too busy
    (active job queue length), it forwards them to other gatherers (other
    methods are crontab-like scheduling), and the results of the jobs (ok,
    failed, retry) are registered also in the cluster state (maybe an internal
    index is better because there can be tens of thousands such jobs)

  • a client can ask for the state of all the gatherers and all the job
    results

  • all jobs can be partitioned and processed in parallel for maximum
    throughput

  • the gatherer also creates metrics/statistics of the jobs successfully
    done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9c814b01-b09e-4974-aca4-0f8489933915%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #9

Found the link where you can read about "deprecation" of the river concept,
and the encouragement of writing individual bulk indexing code:

http://www.linkedin.com/groups/Official-guide-writing-ElasticSearch-rivers-3393294.S.268274223

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGHk580yMezja9uXm-fj52huZAW1UVx9JcDYJJf1Z6BFw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(byron.pezan) #10

i use this in my rabbitmq river setup to ensure that there's always a river
up and consuming messages from rabbitmq regardless of whether a rabbitmq or
elasticsearch node goes down. on each elasticsearch node a separate river
document is created for each rabbitmq cluster node and restricted to run
only on itself via allocation.

so the _river index will end up with multiple documents for each
elasticsearch node:
logger01-1 -> mq01
logger01-2 -> mq02
logger02-1 -> mq01
logger02-2 -> mq02

it works well in my situation, though i have a small cluster. having it
puppet-ized helps deal with scaling out new nodes.

byron

On Monday, December 23, 2013 10:39:11 AM UTC-5, Justin Doles wrote:

This is as close as you can get:
http://www.elasticsearch.org/guide/en/elasticsearch/rivers/current/index.html#allocation

On Saturday, December 21, 2013 5:25:18 PM UTC-5, Karol Gwaj wrote:

Hi,

is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))

as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?

thx for any help,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/6b9bcb92-0339-4543-881b-04e868d66df5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(loïc moriamé) #11

I Jörg !
Did you have any update on your über plugin ?

I'm really interested, because I want to plug my MsSQL DB with
ElasticSearch.
I can't modify the software, but I want to have a "near real time"
integration between my MsSQL DB and ES.

So I hope I can use your work.

Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

  • it can be installed on one or more nodes and provides a common bulk
    indexing framework

  • a gatherer plugin registers in the cluster state (on node level)

  • there are standard capabilities, but a gatherer plugin capability can be
    extended in a live cluster by submitting code for inputs, codecs, and
    filters, picked up by a custom class loader (for example, JDBC, and a
    driver jar, and tabular key/value output)

  • a gatherer plugin is idling, and accepts jobs in form of JSON commands
    (defining the selection of inputs, codecs, and filters), for example, an
    SQL command

  • if a gatherer is told to distribute the jobs fairly and is too busy
    (active job queue length), it forwards them to other gatherers (other
    methods are crontab-like scheduling), and the results of the jobs (ok,
    failed, retry) are registered also in the cluster state (maybe an internal
    index is better because there can be tens of thousands such jobs)

  • a client can ask for the state of all the gatherers and all the job
    results

  • all jobs can be partitioned and processed in parallel for maximum
    throughput

  • the gatherer also creates metrics/statistics of the jobs successfully
    done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #12

Hi Loïc,

the gatherer plugin is still very early (pre-alpha) and not ready because I
do it in my spare time.

Jörg

On Mon, Mar 31, 2014 at 4:09 PM, loïc moriamé moriame@gmail.com wrote:

I Jörg !
Did you have any update on your über plugin ?

I'm really interested, because I want to plug my MsSQL DB with
ElasticSearch.
I can't modify the software, but I want to have a "near real time"
integration between my MsSQL DB and ES.

So I hope I can use your work.

Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :

Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.

The Elasticsearch team is now in favor of Logstash.

I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.

The gatherer "uber" plugin should work like this:

  • it can be installed on one or more nodes and provides a common bulk
    indexing framework

  • a gatherer plugin registers in the cluster state (on node level)

  • there are standard capabilities, but a gatherer plugin capability can
    be extended in a live cluster by submitting code for inputs, codecs, and
    filters, picked up by a custom class loader (for example, JDBC, and a
    driver jar, and tabular key/value output)

  • a gatherer plugin is idling, and accepts jobs in form of JSON commands
    (defining the selection of inputs, codecs, and filters), for example, an
    SQL command

  • if a gatherer is told to distribute the jobs fairly and is too busy
    (active job queue length), it forwards them to other gatherers (other
    methods are crontab-like scheduling), and the results of the jobs (ok,
    failed, retry) are registered also in the cluster state (maybe an internal
    index is better because there can be tens of thousands such jobs)

  • a client can ask for the state of all the gatherers and all the job
    results

  • all jobs can be partitioned and processed in parallel for maximum
    throughput

  • the gatherer also creates metrics/statistics of the jobs successfully
    done

Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)

Right now there is no repo, I plan to kickstart the repo in early 2014.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/39bf0bff-b57b-4865-8d19-a062d9a85544%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGVzHp8vCsf%2BY1%2B9fVy%2BatkQ%2ByPejoMDex_CPwB-mwAsA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #13