Should rivers only index information?


(Vinicius Carvalho) #1

Hi there!

My question is simple: Is it ok to run a river to dump information from ES
to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube
(http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should only
be used to index data, and I'm trying to do the opposite :slight_smile: get stats data
and send elsewhere. I know I could create a daemon elsewhere to poll for
this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to have
the information persisted, I would gladly integrate with bigdesk if that's
the case.

Regards

--


(Otis Gospodnetić) #2

Hello Vinicius,

I remember looking at Rivers with the same sort of question a while back
and, if I remember correctly, I thought that Rivers don't actually need to
be just for indexing. Indeed, at Sematext we've implemented non-indexing
Rivers.

But if you are looking for ElasticSearch stats, you may want to try SPM
(see URL in my sig), which graphs a bunch of ES as well as system and JVM
stats, has filtering, alerts, email subscriptions, etc. A new release is
coming this week, probably Tuesday.

Otis

Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html

On Saturday, September 8, 2012 10:18:11 PM UTC-4, Vinicius Carvalho wrote:

Hi there!

My question is simple: Is it ok to run a river to dump information from ES
to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should only
be used to index data, and I'm trying to do the opposite :slight_smile: get stats data
and send elsewhere. I know I could create a daemon elsewhere to poll for
this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to have
the information persisted, I would gladly integrate with bigdesk if that's
the case.

Regards

--


(Jörg Prante) #3

Hi Vincius,

rivers are always for indexing from external sources,
see http://www.elasticsearch.org/guide/reference/river/

To solve the problem with polling, I implemented a websocket transport
plugin. Websockets are full-duplex communication channels. Such channels
allow pushing events from Elasticsearch to clients without having the
client initiated the request.

With websockets, distributed asynchronous client/server architectures are
possible, also known as pubsub (publish/subscribe). See also
ticket https://github.com/elasticsearch/elasticsearch/issues/1242

I'm in the process of completing it and adding features, a tutorial is
coming soon.

Best regards,

Jörg

On Sunday, September 9, 2012 4:18:11 AM UTC+2, Vinicius Carvalho wrote:

Hi there!

My question is simple: Is it ok to run a river to dump information from ES
to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should only
be used to index data, and I'm trying to do the opposite :slight_smile: get stats data
and send elsewhere. I know I could create a daemon elsewhere to poll for
this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to have
the information persisted, I would gladly integrate with bigdesk if that's
the case.

Regards

--


(Lukáš Vlček) #4

Hi,

do you think you can elaborate more about the problem with bigdesk? Did you
hit any issues?

Lukas
Dne 9.9.2012 4:18 "Vinicius Carvalho" viniciusccarvalho@gmail.com
napsal(a):

Hi there!

My question is simple: Is it ok to run a river to dump information from ES
to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should only
be used to index data, and I'm trying to do the opposite :slight_smile: get stats data
and send elsewhere. I know I could create a daemon elsewhere to poll for
this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to have
the information persisted, I would gladly integrate with bigdesk if that's
the case.

Regards

--

--


(Vinicius Carvalho) #5

Hi Lukas, yes, one problem we noticed is a lot of jquery errors (related to
missing json properties) like jvm info. I was using it with ES 0.19.8. And
seems that there was no jvm info coming from the node, and because of that
none of the charts were drawn.

We loved the idea of bigdesk, only problem is that we would like to have
that information stored as history (hence the idea of sending it to cube or
graphite)

Regads

On Sunday, September 9, 2012 3:48:31 PM UTC-4, Lukáš Vlček wrote:

Hi,

do you think you can elaborate more about the problem with bigdesk? Did
you hit any issues?

Lukas
Dne 9.9.2012 4:18 "Vinicius Carvalho" <vinicius...@gmail.com <javascript:>>
napsal(a):

Hi there!

My question is simple: Is it ok to run a river to dump information from
ES to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should
only be used to index data, and I'm trying to do the opposite :slight_smile: get stats
data and send elsewhere. I know I could create a daemon elsewhere to poll
for this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to
have the information persisted, I would gladly integrate with bigdesk if
that's the case.

Regards

--

--


(Vinicius Carvalho) #6

Hi Jorg, thanks for the help. But as I said we want the daemon to run from
inside ES, that's why rivers seemed to me a bit more appropriate. Maybe one
day we may have a cluster wide singleton plugin and I could move that
direction. I think I'll follow Otis advice and consider the river something
more than just a component to index inside ES. BTW I'm using your awesome
jdbc river as a template to get mine running :slight_smile:

Regards

On Sunday, September 9, 2012 4:58:25 AM UTC-4, Jörg Prante wrote:

Hi Vincius,

rivers are always for indexing from external sources, see
http://www.elasticsearch.org/guide/reference/river/

To solve the problem with polling, I implemented a websocket transport
plugin. Websockets are full-duplex communication channels. Such channels
allow pushing events from Elasticsearch to clients without having the
client initiated the request.

https://github.com/jprante/elasticsearch-transport-websocket

With websockets, distributed asynchronous client/server architectures are
possible, also known as pubsub (publish/subscribe). See also ticket
https://github.com/elasticsearch/elasticsearch/issues/1242

I'm in the process of completing it and adding features, a tutorial is
coming soon.

Best regards,

Jörg

On Sunday, September 9, 2012 4:18:11 AM UTC+2, Vinicius Carvalho wrote:

Hi there!

My question is simple: Is it ok to run a river to dump information from
ES to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should
only be used to index data, and I'm trying to do the opposite :slight_smile: get stats
data and send elsewhere. I know I could create a daemon elsewhere to poll
for this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to
have the information persisted, I would gladly integrate with bigdesk if
that's the case.

Regards

--


(Lukáš Vlček) #7

Dne 10.9.2012 0:41 "Vinicius Carvalho" viniciusccarvalho@gmail.com
napsal(a):

Hi Lukas, yes, one problem we noticed is a lot of jquery errors (related
to missing json properties) like jvm info. I was using it with ES 0.19.8.
And seems that there was no jvm info coming from the node, and because of
that none of the charts were drawn.

Can you send me or gist those errors? Are you missing only the jvm info or
other? Where is your node running? AWS, locally?

We loved the idea of bigdesk, only problem is that we would like to have
that information stored as history (hence the idea of sending it to cube or
graphite)

Bigdesk does not persist data. However, running a river inside the cluster
to pull the stats and store it for persistence (for example into another
cluster) would be nice addition. I was thinking about it.

Regads

On Sunday, September 9, 2012 3:48:31 PM UTC-4, Lukáš Vlček wrote:

Hi,

do you think you can elaborate more about the problem with bigdesk? Did
you hit any issues?

Lukas

Dne 9.9.2012 4:18 "Vinicius Carvalho" vinicius...@gmail.com napsal(a):

Hi there!

My question is simple: Is it ok to run a river to dump information from
ES to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should
only be used to index data, and I'm trying to do the opposite :slight_smile: get stats
data and send elsewhere. I know I could create a daemon elsewhere to poll
for this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to
have the information persisted, I would gladly integrate with bigdesk if
that's the case.

Regards

--

--

--


(Jörg Prante) #8

Hi Vinicius,

For rivers, the singleton in the plugin is helpful to guarantee no data is
indexed twice. There is a switchover construction when the node fails while
a river is running. It should switch over, but the internal river state can
not automatically be detected by ES, and is rather undefined in such case.
Each river plugin needs to implement a mechanism of rollback after a
switchover (by a support index where the last state is stored e.g.)

So the first challenge for a singleton is to ensure no event is dropped
while performing and resuming from a switchover.

Moreover, a singleton will soon become a bottleneck. Such an instance needs
to examine the whole cluster for each and every single event occurred. Most
events appear per index/shard per node, so the growth is more than linear.

A better solution for pushing events would be distributed "event pumps" or
event sources right at the place where events do appear.

The few global cluster events of interest are rare (node join/split,
mapping updates), they will appear at the master node. Many events are node
related events (stats and info about indexes and shards), they should be
created at each node, and transported through a suitable channel node
network exactly to the node that has clients subscribed to that kind of
events (where websockets come into play).

Best regards,

Jörg

On Monday, September 10, 2012 12:44:00 AM UTC+2, Vinicius Carvalho wrote:

Hi Jorg, thanks for the help. But as I said we want the daemon to run from
inside ES, that's why rivers seemed to me a bit more appropriate. Maybe one
day we may have a cluster wide singleton plugin and I could move that
direction. I think I'll follow Otis advice and consider the river something
more than just a component to index inside ES. BTW I'm using your awesome
jdbc river as a template to get mine running :slight_smile:

Regards

On Sunday, September 9, 2012 4:58:25 AM UTC-4, Jörg Prante wrote:

Hi Vincius,

rivers are always for indexing from external sources, see
http://www.elasticsearch.org/guide/reference/river/

To solve the problem with polling, I implemented a websocket transport
plugin. Websockets are full-duplex communication channels. Such channels
allow pushing events from Elasticsearch to clients without having the
client initiated the request.

https://github.com/jprante/elasticsearch-transport-websocket

With websockets, distributed asynchronous client/server architectures are
possible, also known as pubsub (publish/subscribe). See also ticket
https://github.com/elasticsearch/elasticsearch/issues/1242

I'm in the process of completing it and adding features, a tutorial is
coming soon.

Best regards,

Jörg

On Sunday, September 9, 2012 4:18:11 AM UTC+2, Vinicius Carvalho wrote:

Hi there!

My question is simple: Is it ok to run a river to dump information from
ES to other places?

Bellow is the rationale behind it:

Here's my dilema: I'm creating a plugin to dump index stats into cube (
http://square.github.com/cube/). But I found out that only rivers are
cluster singletons. Since I don't want to have each node dumping stats
data, a river seems to be the way of doing.

I know this sounds silly, but it's just that seems that Rivers should
only be used to index data, and I'm trying to do the opposite :slight_smile: get stats
data and send elsewhere. I know I could create a daemon elsewhere to poll
for this data, but I think that will be simpler to have this builtin our ES
nodes.

I tried bigdesk (could not get it to work though) but we also need to
have the information persisted, I would gladly integrate with bigdesk if
that's the case.

Regards

--


(system) #9