[Hadoop] - How write operation is divided into tasks?


A very basic question about implementation. Best understood through the
example of implementation.

Architecture: A 3 node cluster with single index and 32 shards. A type
"data" contains months of data with somewhere around 40K-50K count of
documents per month. A routing value defined using the month and year value
is used to route this data per shard. So, in short 1 month of data goes to
1 shard.

Requirement: Simple requirement: pass a query, get data, update each
document and insert back to the same shard. Since the number of shards = 32
creates 32 tasks, each task fetches 1 month of data, update it and send it
back to ES for writing with same routing value so that it overwrites the
previous document.

Flow: Well the retrieval seems easy, 32 tasks created, one task per shard
and brings the data into a single RDD. Next step update each document. Next
is the step for writing which brings the question as follows:

How does write operation divides itself into tasks?
Doing by documentation, it depends upon the es.batch.size.bytes and
es.batch.size.entries. The value of these two properties defines the number
of tasks. What I presumed was RDD is again partitioned into n number of
tasks depending upon the value specified in these parameters and then that
many number of tasks run to index/update data. However, when I ran write
operation with just a count of 5 documents and with es.batch.size.entries
as 10,000 I still saw as many of 32 tasks doing a write operation on my
es.resource. Still confused on how the task allocation works here. Can you
please explain?

Now comes the another question: In a standalone write to ES operation, how
does code identify which shards contains which routing value? My assumption
was all the tasks sends the data to the ES node which then distributes the
data itself to the shards based on the routing value just like a normal
bulk index operation.

Can you please explain the process of task creations for the two operations

  • read-update-write and only write.

Thanks in advance

Please update your bookmarks! We moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2d9dac53-da38-4309-8dc1-7440cb9479ae%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.