Plugin broadcasting/scalability

A while ago I used a plugin on a project to control the migration to, and
the maintenance of, a secondary data source/index in ElasticSearch. It
worked really well and avoided lots of messy dependency management in the
host application, as all the logic was "hidden" behind an ES rest end-point.

But I was never really sure how scalable this was. I have two questions:

1.) when I register a plugin, is that available for all nodes? (I'm
assuming "yes", which would mean controlling parallel/overlapping call outs
is important)

2.) when i call my end-point, I have a Client Object passed in the
constructor: when I debug this, it is an instance of NodeClient which
presumably means I am working on a single node. Is it possible to construct
a TransportClient from this, so that I can address more than node and take
advantage of e.g. bulk imports in parallel?

Regarding 2.), I've had a look at the JDBC River Code and the Feeder mode
(addressing the cluster from a component running in a separate JVM) seems
to be there precisely because of this drawback. The River mode seems to
work off one node like my plugin did/does.

Is my understanding correct?

Andrew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/16664d7b-3934-4a90-8b1f-209ede3ff08c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

  1. Plugins only run on the node they are installed on. For this, a plugin
    should be installed on all nodes, no matter if it is used on all nodes or
    not. Plugins can have actions implemented that can be addressed by other
    nodes (broadcast operations).

  2. The NodeClient you get is sitting on the local node but can potentially
    address all nodes. Bulk imports automatically use nodes in parallel, i.e.
    they forward actions to the nodes where the shards of the addressed index
    of an operation live.

The JDBC feeder mode works off the cluster and uses a TransportClient. The
reason for the feeder mode is that it works without being dependent on an
ES node life cycle - rivers are flaky when they are forced to hop to
another node and are restarted when the river nodes goes down.

Rivers use NodeClient instances and the bulk indexing scales well. The part
that does not scale in rivers is the river management by ES and the "pull"
style, the non-existent distributed architecture of the fetch phase. So, a
river instance soon becomes a bottleneck or a single point of failure, a
design which does not fit to the well-designed architecture of the rest of
the ES system.

Jörg

On Tue, Dec 23, 2014 at 10:23 AM, AndrewK kenworthyas@gmail.com wrote:

A while ago I used a plugin on a project to control the migration to, and
the maintenance of, a secondary data source/index in Elasticsearch. It
worked really well and avoided lots of messy dependency management in the
host application, as all the logic was "hidden" behind an ES rest end-point.

But I was never really sure how scalable this was. I have two questions:

1.) when I register a plugin, is that available for all nodes? (I'm
assuming "yes", which would mean controlling parallel/overlapping call outs
is important)

2.) when i call my end-point, I have a Client Object passed in the
constructor: when I debug this, it is an instance of NodeClient which
presumably means I am working on a single node. Is it possible to construct
a TransportClient from this, so that I can address more than node and take
advantage of e.g. bulk imports in parallel?

Regarding 2.), I've had a look at the JDBC River Code and the Feeder mode
(addressing the cluster from a component running in a separate JVM) seems
to be there precisely because of this drawback. The River mode seems to
work off one node like my plugin did/does.

Is my understanding correct?

Andrew

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/16664d7b-3934-4a90-8b1f-209ede3ff08c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/16664d7b-3934-4a90-8b1f-209ede3ff08c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHnkAd6UoFFj%2Biv%2B5hZo63X%3D9_LiwTM3sQiE-OFbZPGNw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you for the information - that has clarified things for me!
Andrew

Am Dienstag, 23. Dezember 2014 14:38:34 UTC+1 schrieb Jörg Prante:

  1. Plugins only run on the node they are installed on. For this, a plugin
    should be installed on all nodes, no matter if it is used on all nodes or
    not. Plugins can have actions implemented that can be addressed by other
    nodes (broadcast operations).

  2. The NodeClient you get is sitting on the local node but can potentially
    address all nodes. Bulk imports automatically use nodes in parallel, i.e.
    they forward actions to the nodes where the shards of the addressed index
    of an operation live.

The JDBC feeder mode works off the cluster and uses a TransportClient. The
reason for the feeder mode is that it works without being dependent on an
ES node life cycle - rivers are flaky when they are forced to hop to
another node and are restarted when the river nodes goes down.

Rivers use NodeClient instances and the bulk indexing scales well. The
part that does not scale in rivers is the river management by ES and the
"pull" style, the non-existent distributed architecture of the fetch phase.
So, a river instance soon becomes a bottleneck or a single point of
failure, a design which does not fit to the well-designed architecture of
the rest of the ES system.

Jörg

On Tue, Dec 23, 2014 at 10:23 AM, AndrewK <kenwo...@gmail.com
<javascript:>> wrote:

A while ago I used a plugin on a project to control the migration to, and
the maintenance of, a secondary data source/index in Elasticsearch. It
worked really well and avoided lots of messy dependency management in the
host application, as all the logic was "hidden" behind an ES rest end-point.

But I was never really sure how scalable this was. I have two questions:

1.) when I register a plugin, is that available for all nodes? (I'm
assuming "yes", which would mean controlling parallel/overlapping call outs
is important)

2.) when i call my end-point, I have a Client Object passed in the
constructor: when I debug this, it is an instance of NodeClient which
presumably means I am working on a single node. Is it possible to construct
a TransportClient from this, so that I can address more than node and take
advantage of e.g. bulk imports in parallel?

Regarding 2.), I've had a look at the JDBC River Code and the Feeder mode
(addressing the cluster from a component running in a separate JVM) seems
to be there precisely because of this drawback. The River mode seems to
work off one node like my plugin did/does.

Is my understanding correct?

Andrew

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/16664d7b-3934-4a90-8b1f-209ede3ff08c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/16664d7b-3934-4a90-8b1f-209ede3ff08c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/aee5743a-f706-4034-b139-4a0582d6223a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.