How to setup an ES-to-ES river?

es_learner · September 28, 2012, 12:49am

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My current implementation is to write twice from the client - once to primary index and the other to secondary. Primary index gets pruned every month. Secondary is never pruned.

dadoonet · September 28, 2012, 1:06am

See here: https://github.com/elasticsearch/elasticsearch/issues/1077

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner dave@livefyre.com a écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My current
implementation is to write twice from the client - once to primary index and
the other to secondary. Primary index gets pruned every month. Secondary
is never pruned.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--

James_Boehmer · November 8, 2012, 7:41pm

I'm actually looking for something very similar, which I do not believe is
the same as that _source river request. I need to run two Elasticsearch
stacks separately but simultaneously, to segregate internal traffic from
external traffic. With Solr I would set up a single master, and run two
sets of slaves load balanced independently. That way the internal slaves
could never be affected by traffic hitting the external slaves, and vice
versa. But with ES is there a way to set up a handful of nodes that are
basically their own cluster, but get their data from a master cluster which
does not store the _source?

On Thursday, September 27, 2012 9:06:17 PM UTC-4, David Pilato wrote:

See here: [Feature Request] Add a river to ElasticSearch instance · Issue #1077 · elastic/elasticsearch · GitHub

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner <da...@livefyre.com <javascript:>> a
écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My current
implementation is to write twice from the client - once to primary index
and
the other to secondary. Primary index gets pruned every month. Secondary
is never pruned.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

jprante · November 9, 2012, 12:53am

Hi,

can you please elaborate what is the kind of "traffic"? Is it data load for
indexing, or search requests hitting the cluster, or both of them?
You can set up data-less ES nodes that can absorb all the network
connection load, that is very easy. You can also dedicate data-less nodes
to different ports, if you mean that by addressing internal/external
traffic.

Jörg

On Thursday, November 8, 2012 8:41:35 PM UTC+1, James Boehmer wrote:

I'm actually looking for something very similar, which I do not believe is
the same as that _source river request. I need to run two Elasticsearch
stacks separately but simultaneously, to segregate internal traffic from
external traffic. With Solr I would set up a single master, and run two
sets of slaves load balanced independently. That way the internal slaves
could never be affected by traffic hitting the external slaves, and vice
versa. But with ES is there a way to set up a handful of nodes that are
basically their own cluster, but get their data from a master cluster which
does not store the _source?

On Thursday, September 27, 2012 9:06:17 PM UTC-4, David Pilato wrote:

See here: [Feature Request] Add a river to ElasticSearch instance · Issue #1077 · elastic/elasticsearch · GitHub

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner da...@livefyre.com a écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My
current
implementation is to write twice from the client - once to primary index
and
the other to secondary. Primary index gets pruned every month. Secondary
is never pruned.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

James_Boehmer · November 9, 2012, 2:15am

It would be solely for querying. For example, we'd like to have a cluster
with 5 shards/1 replica being constantly indexed and queried. Then we'd
like to have a second cluster for serving external query traffic, but would
get its data from the first cluster. The second cluster would have its own
complete set of primary/replica shards separate from the first cluster.
However, we would like it to index it passively from the first cluster
instead of having to manually index both clusters simultaneously. The
purpose of the second cluster is to be able to scale and absorb traffic
independently from the internal cluster. It's somewhat important that they
not interfere with each other, but I suppose that an entire single cluster
could scale to handle all of the traffic anyway.

On Thursday, November 8, 2012 7:53:05 PM UTC-5, Jörg Prante wrote:

Hi,

can you please elaborate what is the kind of "traffic"? Is it data load
for indexing, or search requests hitting the cluster, or both of them?
You can set up data-less ES nodes that can absorb all the network
connection load, that is very easy. You can also dedicate data-less nodes
to different ports, if you mean that by addressing internal/external
traffic.

Jörg

On Thursday, November 8, 2012 8:41:35 PM UTC+1, James Boehmer wrote:

I'm actually looking for something very similar, which I do not believe
is the same as that _source river request. I need to run two Elasticsearch
stacks separately but simultaneously, to segregate internal traffic from
external traffic. With Solr I would set up a single master, and run two
sets of slaves load balanced independently. That way the internal slaves
could never be affected by traffic hitting the external slaves, and vice
versa. But with ES is there a way to set up a handful of nodes that are
basically their own cluster, but get their data from a master cluster which
does not store the _source?

On Thursday, September 27, 2012 9:06:17 PM UTC-4, David Pilato wrote:

See here: [Feature Request] Add a river to ElasticSearch instance · Issue #1077 · elastic/elasticsearch · GitHub

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner da...@livefyre.com a écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My
current
implementation is to write twice from the client - once to primary index
and
the other to secondary. Primary index gets pruned every month.
Secondary
is never pruned.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

jprante · November 9, 2012, 10:37am

Hi James,

you can choose a setup within a single cluster, where the nodes (the
cluster members) serve different purposes. No need for a second cluster.

ES nodes can be started in a data-only mode, without HTTP server, so they
never process client requests, but only do the heavy lifting.

Proxy nodes can be started without data, but with HTTP, so they only
process client requests and forward them to the data nodes involved in the
queries.

You can start as many proxy nodes and data nodes as you want, so you scale
the nodes in two aspects.

In my view, if you separate proxy and data nodes into two clusters, there
are much hassles. Nodes can not talk to each other over cluster boundaries.
You would have to store your data twice by doing it with your client tool
alone (while ES can do it for you a lot easier by using replica levels),
and afterwards, you would have to keep the data in sync when nodes fail
(what is tedious when doing it with external client tools, while ES is
doing it for you automatically by replicated shards and allocation control).

Cheers,

Jörg

On Friday, November 9, 2012 3:15:39 AM UTC+1, James Boehmer wrote:

It would be solely for querying. For example, we'd like to have a cluster
with 5 shards/1 replica being constantly indexed and queried. Then we'd
like to have a second cluster for serving external query traffic, but would
get its data from the first cluster. The second cluster would have its own
complete set of primary/replica shards separate from the first cluster.
However, we would like it to index it passively from the first cluster
instead of having to manually index both clusters simultaneously. The
purpose of the second cluster is to be able to scale and absorb traffic
independently from the internal cluster. It's somewhat important that they
not interfere with each other, but I suppose that an entire single cluster
could scale to handle all of the traffic anyway.

On Thursday, November 8, 2012 7:53:05 PM UTC-5, Jörg Prante wrote:

Hi,

can you please elaborate what is the kind of "traffic"? Is it data load
for indexing, or search requests hitting the cluster, or both of them?
You can set up data-less ES nodes that can absorb all the network
connection load, that is very easy. You can also dedicate data-less nodes
to different ports, if you mean that by addressing internal/external
traffic.

Jörg

On Thursday, November 8, 2012 8:41:35 PM UTC+1, James Boehmer wrote:

I'm actually looking for something very similar, which I do not believe
is the same as that _source river request. I need to run two Elasticsearch
stacks separately but simultaneously, to segregate internal traffic from
external traffic. With Solr I would set up a single master, and run two
sets of slaves load balanced independently. That way the internal slaves
could never be affected by traffic hitting the external slaves, and vice
versa. But with ES is there a way to set up a handful of nodes that are
basically their own cluster, but get their data from a master cluster which
does not store the _source?

On Thursday, September 27, 2012 9:06:17 PM UTC-4, David Pilato wrote:

See here: [Feature Request] Add a river to ElasticSearch instance · Issue #1077 · elastic/elasticsearch · GitHub

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner da...@livefyre.com a écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My
current
implementation is to write twice from the client - once to primary
index and
the other to secondary. Primary index gets pruned every month.
Secondary
is never pruned.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

James_Boehmer · November 10, 2012, 1:29pm

Hi Jörg,

We would like each cluster to do its own heavy lifting. As for HTTP, we
are using a load balancer for insertions and queries, so essentially every
node in the clusters gets to serve both purposes in round robin fashion. I
do not think separating the HTTP requests from the heavy lifting of
searching is quite what we're looking for in this situation. But what do
you mean by replica levels? Would that imply creating additional replicas
of a shard, and assigning them to specific nodes?

-Jim

On Friday, November 9, 2012 5:37:14 AM UTC-5, Jörg Prante wrote:

Hi James,

you can choose a setup within a single cluster, where the nodes (the
cluster members) serve different purposes. No need for a second cluster.

ES nodes can be started in a data-only mode, without HTTP server, so they
never process client requests, but only do the heavy lifting.

Proxy nodes can be started without data, but with HTTP, so they only
process client requests and forward them to the data nodes involved in the
queries.

You can start as many proxy nodes and data nodes as you want, so you scale
the nodes in two aspects.

In my view, if you separate proxy and data nodes into two clusters, there
are much hassles. Nodes can not talk to each other over cluster boundaries.
You would have to store your data twice by doing it with your client tool
alone (while ES can do it for you a lot easier by using replica levels),
and afterwards, you would have to keep the data in sync when nodes fail
(what is tedious when doing it with external client tools, while ES is
doing it for you automatically by replicated shards and allocation control).

Cheers,

Jörg

On Friday, November 9, 2012 3:15:39 AM UTC+1, James Boehmer wrote:

It would be solely for querying. For example, we'd like to have a
cluster with 5 shards/1 replica being constantly indexed and queried. Then
we'd like to have a second cluster for serving external query traffic, but
would get its data from the first cluster. The second cluster would have
its own complete set of primary/replica shards separate from the first
cluster. However, we would like it to index it passively from the first
cluster instead of having to manually index both clusters simultaneously.
The purpose of the second cluster is to be able to scale and absorb
traffic independently from the internal cluster. It's somewhat important
that they not interfere with each other, but I suppose that an entire
single cluster could scale to handle all of the traffic anyway.

On Thursday, November 8, 2012 7:53:05 PM UTC-5, Jörg Prante wrote:

Hi,

can you please elaborate what is the kind of "traffic"? Is it data load
for indexing, or search requests hitting the cluster, or both of them?
You can set up data-less ES nodes that can absorb all the network
connection load, that is very easy. You can also dedicate data-less nodes
to different ports, if you mean that by addressing internal/external
traffic.

Jörg

On Thursday, November 8, 2012 8:41:35 PM UTC+1, James Boehmer wrote:

I'm actually looking for something very similar, which I do not believe
is the same as that _source river request. I need to run two Elasticsearch
stacks separately but simultaneously, to segregate internal traffic from
external traffic. With Solr I would set up a single master, and run two
sets of slaves load balanced independently. That way the internal slaves
could never be affected by traffic hitting the external slaves, and vice
versa. But with ES is there a way to set up a handful of nodes that are
basically their own cluster, but get their data from a master cluster which
does not store the _source?

On Thursday, September 27, 2012 9:06:17 PM UTC-4, David Pilato wrote:

See here: [Feature Request] Add a river to ElasticSearch instance · Issue #1077 · elastic/elasticsearch · GitHub

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 28 sept. 2012 à 02:49, es_learner da...@livefyre.com a écrit :

Is it supported?

The objective here is to 'tee' new docs into a secondary index. My
current
implementation is to write twice from the client - once to primary
index and
the other to secondary. Primary index gets pruned every month.
Secondary
is never pruned.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/How-to-setup-an-ES-to-ES-river-tp4023286.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--

--

Topic		Replies	Views
Problem with keeping in sync Elasticsearch across two data centers Elasticsearch	9	2179	July 6, 2017
ES-to-ES river? Elasticsearch	4	372	July 6, 2017
How to ensure that rivers are equally distributed across nodes in the cluster Elasticsearch	12	472	July 6, 2017
Should rivers only index information? Elasticsearch	8	412	July 6, 2017
ElasticSearch across multiple data center architecture design options Elasticsearch	8	2397	July 6, 2017

How to setup an ES-to-ES river?

Related topics