[Hadoop][Pig] Loadbalancing over multiple servers


(Niels Basjes) #1

Hi,

I have a basic experiment setup where I use the Hadoop Pig Elasticsearch
combination to map-reduce data into my 6 node ES cluster.

The command I currently use to store the information is

STORE WantedOrders INTO 'orderitems/orderitem' USING
org.elasticsearch.hadoop.pig.ESStorage('es.host=node10.kluster.basjes.lan;es.mapping.names=date:@timestamp');

The problem I have is that when I look at the load on the various systems I
find that this node10 server is the master for all partitions and that it
has a very high cpu load.
All other servers are 'near idle'.

What is the correct way to spread the load over all servers in the ES
cluster and thus speedup the process of loading/indexing the data?

Niels Basjes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #2

Hi,

That's because the node receives all the write traffic at the moment. There are plans in the current release cycle to:
a. add multiple nodes (mainly for discovery purposes)
b. do concurrent writes on the target nodes

Do you use certain shard/partitioning settings or just the defaults? Also, how big is your data set in Pig?

Cheers,

On 31/10/2013 11:02 AM, Niels Basjes wrote:

Hi,

I have a basic experiment setup where I use the Hadoop Pig Elasticsearch combination to map-reduce data into my 6 node
ES cluster.

The command I currently use to store the information is

STORE WantedOrders INTO 'orderitems/orderitem' USING
org.elasticsearch.hadoop.pig.ESStorage('es.host=node10.kluster.basjes.lan;es.mapping.names=date:@timestamp');

The problem I have is that when I look at the load on the various systems I find that this node10 server is the master
for all partitions and that it has a very high cpu load.
All other servers are 'near idle'.

What is the correct way to spread the load over all servers in the ES cluster and thus speedup the process of
loading/indexing the data?

Niels Basjes

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Damien Hardy) #3

Hello,

As ES pig UDF is based on HTTP ES interface (9200) you can add some HTTP
load balancer (like haproxy) with your 6 node ES cluster as backend.
And use it as es.host setting.

Best regards,

2013/10/31 Niels Basjes niels@basj.es

Hi,

I have a basic experiment setup where I use the Hadoop Pig Elasticsearch
combination to map-reduce data into my 6 node ES cluster.

The command I currently use to store the information is

STORE WantedOrders INTO 'orderitems/orderitem' USING
org.elasticsearch.hadoop.pig.ESStorage('es.host=node10.kluster.basjes.lan;es.mapping.names=date:@timestamp
');

The problem I have is that when I look at the load on the various systems
I find that this node10 server is the master for all partitions and that it
has a very high cpu load.
All other servers are 'near idle'.

What is the correct way to spread the load over all servers in the ES
cluster and thus speedup the process of loading/indexing the data?

Niels Basjes

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Niels Basjes) #4

Hi,

Op donderdag 31 oktober 2013 10:29:29 UTC+1 schreef Costin Leau:

That's because the node receives all the write traffic at the moment.

I thought as much

There are plans in the current release cycle to:
a. add multiple nodes (mainly for discovery purposes)
b. do concurrent writes on the target nodes

Any indication when this is planned to become available?

Do you use certain shard/partitioning settings or just the defaults?

Nothing special, just getting started.

Also, how big is your data set in Pig?

129M very small documents (approx 20GB in size). Main goal is a simple
graph in kibana.

Niels

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #5

If everything goes according to plan, we'll have something in master by next week.
You can watch this issue for updates [1]

[1] https://github.com/elasticsearch/elasticsearch-hadoop/issues/74

On 31/10/2013 2:41 PM, Niels Basjes wrote:

Hi,

Op donderdag 31 oktober 2013 10:29:29 UTC+1 schreef Costin Leau:

That's because the node receives all the write traffic at the moment.

I thought as much

There are plans in the current release cycle to:
a. add multiple nodes (mainly for discovery purposes)
b. do concurrent writes on the target nodes

Any indication when this is planned to become available?

Do you use certain shard/partitioning settings or just the defaults?

Nothing special, just getting started.

Also, how big is your data set in Pig?

129M very small documents (approx 20GB in size). Main goal is a simple graph in kibana.
Niels

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Niels Basjes) #6

Hi,

Op donderdag 31 oktober 2013 10:45:47 UTC+1 schreef Damien Hardy:

As ES pig UDF is based on HTTP ES interface (9200) you can add some HTTP
load balancer (like haproxy) with your 6 node ES cluster as backend.
And use it as es.host setting.

I just installed haproxy and this seems to work great.
Thanks!

Niels Basjes

P.S. Now I'm overloading my experimental ES setup ...

java.lang.IllegalStateException: [POST] on [orderitems/orderitem/_bulk] failed; server[http://node11.kluster.basjes.lan] returned [

504 Gateway Time-out


The server didn't respond in time.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #7

Hi,

The master has been enhanced so that all writes are concurrent based on the number of shards (!) (not nodes) available
for the target index. You don't need to specify multiple hosts - the discovery happens automatically at runtime and it
works on all frameworks - M&R, Cascading, Pig and Hive.
The concurrent/parallel writes occur for both map-only and map-reduce jobs as the map or reduce tasks are
'spread/partitioned' on top of the shards of the target.

A note on performance (until the docs will be updated) - in most cases, the jobs only have mapper (no reducers) - Pig
and Hive are smart enough to figure this out and eliminate the reduce phase, M&R isn't, so consider disabling the reduce
phase for improved performance.
This means that the write parallelism is driven by the number of input splits for your job - that is if your job is not
splittable (the source file is compressed), you will still end up with only one task writing to ES. As you have (a lot)
of small files, this shouldn't be the case.

I've pushed the nightly build (elasticsearch-hadoop-1.3.0.BUILD-20131102.205934-207) to Maven central so the snapshot
should be downloaded automatically by any Maven tool.

Let me know how it goes.

Cheers!

On 31/10/2013 2:41 PM, Niels Basjes wrote:

Hi,

Op donderdag 31 oktober 2013 10:29:29 UTC+1 schreef Costin Leau:

That's because the node receives all the write traffic at the moment.

I thought as much

There are plans in the current release cycle to:
a. add multiple nodes (mainly for discovery purposes)
b. do concurrent writes on the target nodes

Any indication when this is planned to become available?

Do you use certain shard/partitioning settings or just the defaults?

Nothing special, just getting started.

Also, how big is your data set in Pig?

129M very small documents (approx 20GB in size). Main goal is a simple graph in kibana.
Niels

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #8