[Hadoop] - Difference between task creation for a write and read-update-write operation in ES


(Piyush Goyal) #1

Hi Costin,

I saw a different behavior of creating task for write to ES operation while
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into
ES, the task are created based on property "es.batch.size.bytes" and
"es.batch.size.entries". Number of task created = Number of documents in
RDD/the value of either of these properties. The request hits the node and
node decides the shard to which document needs routed based on routing
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data
from ES, store it in RDD, do some updates in the documents in RDD and then
index these documents back to ES. While reading, the number of tasks are
created on basis of number of shards and I presume that each tasks fetch
data from each Shard(not sure of how it works? - Task delagting request to
node to serve data from a particular shard?). Now when I try to
update/re-index data using same RDD and function saveToESWithMetadata, this
time the number of task created is a number which is not based on point 1.
If the data in each partition is less than property
"es.batch.size.entries", it creates the same number of tasks as are the
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is
from particular shard, does write operation also write to a shard or all
the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while
working on my project. The difference is as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into
ES, the task are created based on property "es.batch.size.bytes" and
"es.batch.size.entries". Number of task created = Number of documents in
RDD/the value of either of these properties. The request hits the node and
node decides the shard to which document needs routed based on routing
value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data
from ES, store it in RDD, do some updates in the documents in RDD and then
index these documents back to ES. While reading, the number of tasks are
created on basis of number of shards and I presume that each tasks fetch
data from each Shard(not sure of how it works? - Task delagting request to
node to serve data from a particular shard?). Now when I try to
update/re-index data using same RDD and function saveToESWithMetadata, this
time the number of task created is a number which is not based on point 1.
If the data in each partition is less than property
"es.batch.size.entries", it creates the same number of tasks as are the
number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is
from particular shard, does write operation also write to a shard or all
the task delegate their request to the node?

Thanks in advance
Piyush

--
Please update your bookmarks! We moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Costin Leau) #2

Hi,

First it would help to know what version of Elaticsearch, Elasticsearch Hadoop, JVM and Spark are you using.

On 5/6/15 1:41 PM, piyush goyal wrote:

Hi Costin,

I saw a different behavior of creating task for write to ES operation while working on my project. The difference is
as follows:

1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property
"es.batch.size.bytes" and "es.batch.size.entries". Number of task created = Number of documents in RDD/the value of
either of these properties. The request hits the node and node decides the shard to which document needs routed based
on routing value(if specified).

What makes you say that? In case of writing, the number of writers is determined by the number of shards of your target
index. The more shards, the more concurrent writers. The behavior of all writers can be further tweaked through the
properties mentioned however they do NOT affect the process parallelism.

2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in
the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on
basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task
delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same
RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1.
If the data in each partition is less than property "es.batch.size.entries", it creates the same number of tasks as
are the number of shards, else greater than it.

In case of reading, the number of tasks that can work in parallel is determine by the source parallelism - its number of
partitions. So if you have an RDD with 1 partition likely it will result into one task that will write to another RDD
down the line. Assuming that RDD is backed by Elastic - even if the index has 10 shards and thus can have a parallelism
of 10, if the source has only one partition and there's only one task, there's nothing the connector can do to increase
this number.

Again, remember that the connector is not an active component per se - rather it bridges two systems. It's spark and
more importantly the RDDs structures that control the number of tasks/threads that can work in parallel at a given time.

Going forward, let's take this to Discuss. It's a discussion that I'm sure will benefit other users and hopefully it
will be better archived there.

What's the reason behind this? Also like read operation where request is from particular shard, does write operation
also write to a shard or all the task delegate their request to the node?

Thanks in advance
Piyush Costin,

I saw a different behavior of creating task for write to ES operation while working on my project. The difference is
as follows:

|1.) Only write to ES - When I create an RDD of my own to insert data into ES, the task are created based on property
"es.batch.size.bytes" and "es.batch.size.entries". Number of task created = Number of documents in RDD/the value of
either of these properties. The request hits the node and node decides the shard to which document needs routed based
on routing value(if specified).

2.) Read-Update-write to ES - Consider this case when I have to read data from ES, store it in RDD, do some updates in
the documents in RDD and then index these documents back to ES. While reading, the number of tasks are created on
basis of number of shards and I presume that each tasks fetch data from each Shard(not sure of how it works? - Task
delagting request to node to serve data from a particular shard?). Now when I try to update/re-index data using same
RDD and function saveToESWithMetadata, this time the number of task created is a number which is not based on point 1.
If the data in each partition is less than property "es.batch.size.entries", it creates the same number of tasks as
are the number of shards, else greater than it.

What's the reason behind this? Also like read operation where request is from particular shard, does write operation
also write to a shard or all the task delegate their request to the node?

Thanks in advance

Piyush

Please update your bookmarks! We moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/ec268e76-6220-430b-958a-884692283ca0%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Costin

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/55544BCE.6060201%40gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3