is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will like
to avoid situation when they all run on this same node (i want to
distribute the load across cluster))
as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?
On Saturday, December 21, 2013 5:25:18 PM UTC-5, Karol Gwaj wrote:
Hi,
is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))
as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?
Rivers are designed to be a cluster singleton - so they only run on one
system. The client object is connected to the local node. It might make
more sense to have to have another component in front which aggregates your
data - maybe logstash, maybe self-written components...
--Alex
On Sat, Dec 21, 2013 at 11:25 PM, Karol Gwaj karol@gwaj.me wrote:
Hi,
is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))
as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?
Distributing gatherers across the nodes of a cluster is far more like the
original river concept.
FYI I am working on an uber-plugin which can distribute river-like mini
jobs in a scatter/gather manner. Example: install one plugin, execute 100
SQL statements over the whole cluster, index docs in ES, and coordinate the
results. This is not really much fun with the JDBC river at at the moment.
My aim is to get independent of the river plugin method, because rivers are
going to be deprecated some day in the near future.
i didnt know that river plugin will be deprecated,
suddenly i very interested in your uber-plugin and the distributed
gatherers concept is exactly what i need
can you tell me more about this plugin ?
will it be distributed with elasticsearch or as community contribution ?
do you have already repository with some beta (or alpha) version that i can
eyeball?
thx,
Karol
On Wednesday, December 25, 2013 10:01:10 AM UTC, Jörg Prante wrote:
Distributing gatherers across the nodes of a cluster is far more like the
original river concept.
FYI I am working on an uber-plugin which can distribute river-like mini
jobs in a scatter/gather manner. Example: install one plugin, execute 100
SQL statements over the whole cluster, index docs in ES, and coordinate the
results. This is not really much fun with the JDBC river at at the moment.
My aim is to get independent of the river plugin method, because rivers
are going to be deprecated some day in the near future.
Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.
The Elasticsearch team is now in favor of Logstash.
I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.
The gatherer "uber" plugin should work like this:
it can be installed on one or more nodes and provides a common bulk
indexing framework
a gatherer plugin registers in the cluster state (on node level)
there are standard capabilities, but a gatherer plugin capability can be
extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)
a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command
if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)
a client can ask for the state of all the gatherers and all the job
results
all jobs can be partitioned and processed in parallel for maximum
throughput
the gatherer also creates metrics/statistics of the jobs successfully done
Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)
Right now there is no repo, I plan to kickstart the repo in early 2014.
Looking forward to seeing what you come up with Jörg. I can tell you've
been thinking on this for a while.
Justin
On Thursday, December 26, 2013 7:37:36 AM UTC-5, Jörg Prante wrote:
Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.
The Elasticsearch team is now in favor of Logstash.
I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.
The gatherer "uber" plugin should work like this:
it can be installed on one or more nodes and provides a common bulk
indexing framework
a gatherer plugin registers in the cluster state (on node level)
there are standard capabilities, but a gatherer plugin capability can be
extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)
a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command
if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)
a client can ask for the state of all the gatherers and all the job
results
all jobs can be partitioned and processed in parallel for maximum
throughput
the gatherer also creates metrics/statistics of the jobs successfully
done
Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)
Right now there is no repo, I plan to kickstart the repo in early 2014.
yep it sounds great, cant wait to see some beta version to play with
i gave a quick look into logstash, but it is not exactly what i want
and feels too 'resources heavy' for me to install additional framework on
every node (or have dedicated nodes for it)
will be nice if elasticsearch team could extract river functionality into
some kind of plugin and contribute it to community (before deprecating it)
so if someone still wants to use rivers, they will be able too
Cheers,
Karol
On Thursday, December 26, 2013 12:37:36 PM UTC, Jörg Prante wrote:
Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.
The Elasticsearch team is now in favor of Logstash.
I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.
The gatherer "uber" plugin should work like this:
it can be installed on one or more nodes and provides a common bulk
indexing framework
a gatherer plugin registers in the cluster state (on node level)
there are standard capabilities, but a gatherer plugin capability can be
extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)
a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command
if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)
a client can ask for the state of all the gatherers and all the job
results
all jobs can be partitioned and processed in parallel for maximum
throughput
the gatherer also creates metrics/statistics of the jobs successfully
done
Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)
Right now there is no repo, I plan to kickstart the repo in early 2014.
i use this in my rabbitmq river setup to ensure that there's always a river
up and consuming messages from rabbitmq regardless of whether a rabbitmq or
elasticsearch node goes down. on each elasticsearch node a separate river
document is created for each rabbitmq cluster node and restricted to run
only on itself via allocation.
so the _river index will end up with multiple documents for each
elasticsearch node:
logger01-1 -> mq01
logger01-2 -> mq02
logger02-1 -> mq01
logger02-2 -> mq02
it works well in my situation, though i have a small cluster. having it
puppet-ized helps deal with scaling out new nodes.
byron
On Monday, December 23, 2013 10:39:11 AM UTC-5, Justin Doles wrote:
On Saturday, December 21, 2013 5:25:18 PM UTC-5, Karol Gwaj wrote:
Hi,
is there any build-in way to ensure that rivers are equally distributed
across nodes in the elasticsearch cluster ?
(i have like 100 rivers pulling data from different sources and i will
like to avoid situation when they all run on this same node (i want to
distribute the load across cluster))
as i understand river will try run on the node on which it was created,
is it possible (from inside plugin) to force creation of the river on
another node (the Client object obtained from inside plugin only talks to
current (local) node) ?
I Jörg !
Did you have any update on your über plugin ?
I'm really interested, because I want to plug my MsSQL DB with
Elasticsearch.
I can't modify the software, but I want to have a "near real time"
integration between my MsSQL DB and ES.
So I hope I can use your work.
Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :
Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.
The Elasticsearch team is now in favor of Logstash.
I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.
The gatherer "uber" plugin should work like this:
it can be installed on one or more nodes and provides a common bulk
indexing framework
a gatherer plugin registers in the cluster state (on node level)
there are standard capabilities, but a gatherer plugin capability can be
extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)
a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command
if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)
a client can ask for the state of all the gatherers and all the job
results
all jobs can be partitioned and processed in parallel for maximum
throughput
the gatherer also creates metrics/statistics of the jobs successfully
done
Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)
Right now there is no repo, I plan to kickstart the repo in early 2014.
On Mon, Mar 31, 2014 at 4:09 PM, loïc moriamé moriame@gmail.com wrote:
I Jörg !
Did you have any update on your über plugin ?
I'm really interested, because I want to plug my MsSQL DB with
Elasticsearch.
I can't modify the software, but I want to have a "near real time"
integration between my MsSQL DB and ES.
So I hope I can use your work.
Le jeudi 26 décembre 2013 13:37:36 UTC+1, Jörg Prante a écrit :
Rivers were once introduced for demo purposes to load quickly some data
into ES and make showcases from twitter or wikipedia data.
The Elasticsearch team is now in favor of Logstash.
I start this gatherer plugin for my uses cases where I am not able to use
Logstash. I have very complex streams, e.g. ISO 2709 record formats with
some hundred custom transformations in the data, that I reduce to primitive
key/value streams and RDF triples. Also I plan to build RDF feeds for
semantic web/linked data platforms, where ES is the search engine.
The gatherer "uber" plugin should work like this:
it can be installed on one or more nodes and provides a common bulk
indexing framework
a gatherer plugin registers in the cluster state (on node level)
there are standard capabilities, but a gatherer plugin capability can
be extended in a live cluster by submitting code for inputs, codecs, and
filters, picked up by a custom class loader (for example, JDBC, and a
driver jar, and tabular key/value output)
a gatherer plugin is idling, and accepts jobs in form of JSON commands
(defining the selection of inputs, codecs, and filters), for example, an
SQL command
if a gatherer is told to distribute the jobs fairly and is too busy
(active job queue length), it forwards them to other gatherers (other
methods are crontab-like scheduling), and the results of the jobs (ok,
failed, retry) are registered also in the cluster state (maybe an internal
index is better because there can be tens of thousands such jobs)
a client can ask for the state of all the gatherers and all the job
results
all jobs can be partitioned and processed in parallel for maximum
throughput
the gatherer also creates metrics/statistics of the jobs successfully
done
Another thing I find important is to enable scripting for processing the
data streams (JSR 223 scripting, especially Groovy, Jython, Jruby,
Rhino/Nashorn)
Right now there is no repo, I plan to kickstart the repo in early 2014.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.