[Hadoop] How to collect stats in elasticsearch MR job


(Abhijit Bose) #1

Hello,

I would like to collect some stats on the entries being written when
running a MapReduce job using the elasticsearch-hadoop library. I am using
the default Mapper.class with a batch of entries in JSON files as input to
MR, e.g.
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(EsOutputFormat.class);
job.setMapOutputValueClass(Text.class);

    job.setMapperClass(Mapper.class);
    job.setNumReduceTasks(0);

What I would like to do is collect stats on the entries that were passed on
to ESOutPutFormat that got written to the ES cluster, similar to how one
would collect the stats using BulkResponse, mostly around how many millis
it took for the batch operation.

For a new index to be populated with a bunch of docs via MR (e.g. daily
logs), there is an easier way to do this. In the main MR thread, once the
map tasks finish, I can call the Stats API on the index to get the stats up
to the time of the API call. However, when I am writing in batch from a
Mapper-->ESOutPutFormat-->RestRepository, I would like to collect the
stats at this time. Is that possible without extending the current
library (e.g. by introducing a set of MR Counters in ESOutPutFormat.java) ?

Thanks!

Abhijit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7090aaee-8436-4bfa-b78a-ebeb7cd4aff8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #2

We can introduce such counters. What exactly are you interested in?
The default counters in Hadoop provide information on the amount of data read/written.
Do you want to extract the information directly in Hadoop as oppose to ES proper?

On 12/02/2014 5:13 PM, Abhijit Bose wrote:

Hello,

I would like to collect some stats on the entries being written when running a MapReduce job using the
elasticsearch-hadoop library. I am using the default Mapper.class with a batch of entries in JSON files as input to MR,
e.g.
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(EsOutputFormat.class);
job.setMapOutputValueClass(Text.class);

     job.setMapperClass(Mapper.class);
     job.setNumReduceTasks(0);

What I would like to do is collect stats on the entries that were passed on to ESOutPutFormat that got written to the ES
cluster, similar to how one would collect the stats using BulkResponse, mostly around how many millis it took for the
batch operation.

For a new index to be populated with a bunch of docs via MR (e.g. daily logs), there is an easier way to do this. In the
main MR thread, once the map tasks finish, I can call the Stats API on the index to get the stats up to the time of the
API call. However, when I am writing in batch from a Mapper-->ESOutPutFormat-->RestRepository, I would like to collect
the stats at this time. Is that possible without extending the current library (e.g. by introducing a set of MR
Counters in ESOutPutFormat.java) ?

Thanks!

Abhijit

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7090aaee-8436-4bfa-b78a-ebeb7cd4aff8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52FB9379.7070002%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Abhijit Bose) #3

This is to capture the time taken by ES to process the items in that batch
of records. Yes the total size written in bytes will already be in a MR
counter.
On Feb 12, 2014 8:30 AM, "Costin Leau" costin.leau@gmail.com wrote:

We can introduce such counters. What exactly are you interested in?
The default counters in Hadoop provide information on the amount of data
read/written.
Do you want to extract the information directly in Hadoop as oppose to ES
proper?

On 12/02/2014 5:13 PM, Abhijit Bose wrote:

Hello,

I would like to collect some stats on the entries being written when
running a MapReduce job using the
elasticsearch-hadoop library. I am using the default Mapper.class with a
batch of entries in JSON files as input to MR,
e.g.
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(EsOutputFormat.class);
job.setMapOutputValueClass(Text.class);

     job.setMapperClass(Mapper.class);
     job.setNumReduceTasks(0);

What I would like to do is collect stats on the entries that were passed
on to ESOutPutFormat that got written to the ES
cluster, similar to how one would collect the stats using BulkResponse,
mostly around how many millis it took for the
batch operation.

For a new index to be populated with a bunch of docs via MR (e.g. daily
logs), there is an easier way to do this. In the
main MR thread, once the map tasks finish, I can call the Stats API on
the index to get the stats up to the time of the
API call. However, when I am writing in batch from a
Mapper-->ESOutPutFormat-->RestRepository, I would like to collect
the stats at this time. Is that possible without extending the current
library (e.g. by introducing a set of MR
Counters in ESOutPutFormat.java) ?

Thanks!

Abhijit

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7090aaee-
8436-4bfa-b78a-ebeb7cd4aff8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/NF1sSaHzQU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/52FB9379.7070002%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPyBHrsGn_vb4V7m2Cg-gG8TwMva_2y1pDYBfBHKHsMSF-U8Lg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #4

Can you please raise an issue to make sure we don't lose sight of this one?

Thanks,

On 12/02/2014 7:59 PM, A Bose wrote:

This is to capture the time taken by ES to process the items in that batch of records. Yes the total size written in
bytes will already be in a MR counter.

On Feb 12, 2014 8:30 AM, "Costin Leau" <costin.leau@gmail.com mailto:costin.leau@gmail.com> wrote:

We can introduce such counters. What exactly are you interested in?
The default counters in Hadoop provide information on the amount of data read/written.
Do you want to extract the information directly in Hadoop as oppose to ES proper?

On 12/02/2014 5:13 PM, Abhijit Bose wrote:

    Hello,

    I would like to collect some stats on the entries being written when running a MapReduce job using the
    elasticsearch-hadoop library.  I am using the default Mapper.class with a batch of entries in JSON files as
    input to MR,
    e.g.
              job.setInputFormatClass(__TextInputFormat.class);
              job.setOutputFormatClass(__EsOutputFormat.class);
              job.setMapOutputValueClass(__Text.class);

              job.setMapperClass(Mapper.__class);
              job.setNumReduceTasks(0);

    What I would like to do is collect stats on the entries that were passed on to ESOutPutFormat that got written
    to the ES
    cluster, similar to how one would collect the stats using BulkResponse, mostly around how many millis it took
    for the
    batch operation.

    For a new index to be populated with a bunch of docs via MR (e.g. daily logs), there is an easier way to do
    this. In the
    main MR thread, once the map tasks finish, I can call the Stats API on the index to get the stats up to the time
    of the
    API call. However,  when I am writing in batch from a Mapper-->ESOutPutFormat-->__RestRepository,  I would like
    to collect
    the stats at this time.   Is that possible without extending the current library (e.g. by introducing a set of MR
    Counters in ESOutPutFormat.java) ?

    Thanks!

    Abhijit

    --
    You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to
    elasticsearch+unsubscribe@__googlegroups.com <mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
    To view this discussion on the web visit
    https://groups.google.com/d/__msgid/elasticsearch/7090aaee-__8436-4bfa-b78a-ebeb7cd4aff8%__40googlegroups.com
    <https://groups.google.com/d/msgid/elasticsearch/7090aaee-8436-4bfa-b78a-ebeb7cd4aff8%40googlegroups.com>.
    For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.


--
Costin

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/__topic/elasticsearch/__NF1sSaHzQU0/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/NF1sSaHzQU0/unsubscribe>.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/__msgid/elasticsearch/52FB9379.__7070002%40gmail.com
<https://groups.google.com/d/msgid/elasticsearch/52FB9379.7070002%40gmail.com>.
For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPyBHrsGn_vb4V7m2Cg-gG8TwMva_2y1pDYBfBHKHsMSF-U8Lg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52FCDB35.2090805%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Costin Leau) #5

As an update,

This is now supported in master (the upcoming elasticsearch hadoop 1.3.0.M3). From the console:

01:13:50,630 INFO main mapred.JobClient - Elasticsearch Hadoop Counters
01:13:50,630 INFO main mapred.JobClient - Bytes Written=173923
01:13:50,630 INFO main mapred.JobClient - Bytes Read=0
01:13:50,631 INFO main mapred.JobClient - Bulk Retries=0
01:13:50,631 INFO main mapred.JobClient - Network Retries=0
01:13:50,631 INFO main mapred.JobClient - Bulk Writes=22
01:13:50,631 INFO main mapred.JobClient - Documents Read=0
01:13:50,631 INFO main mapred.JobClient - Documents Written=993
01:13:50,631 INFO main mapred.JobClient - Node Retries=0
01:13:50,631 INFO main mapred.JobClient - Documents Retried=0

Cheers,

On 12/02/2014 7:59 PM, A Bose wrote:

This is to capture the time taken by ES to process the items in that batch of records. Yes the total size written in
bytes will already be in a MR counter.

On Feb 12, 2014 8:30 AM, "Costin Leau" <costin.leau@gmail.com mailto:costin.leau@gmail.com> wrote:

We can introduce such counters. What exactly are you interested in?
The default counters in Hadoop provide information on the amount of data read/written.
Do you want to extract the information directly in Hadoop as oppose to ES proper?

On 12/02/2014 5:13 PM, Abhijit Bose wrote:

    Hello,

    I would like to collect some stats on the entries being written when running a MapReduce job using the
    elasticsearch-hadoop library.  I am using the default Mapper.class with a batch of entries in JSON files as
    input to MR,
    e.g.
              job.setInputFormatClass(__TextInputFormat.class);
              job.setOutputFormatClass(__EsOutputFormat.class);
              job.setMapOutputValueClass(__Text.class);

              job.setMapperClass(Mapper.__class);
              job.setNumReduceTasks(0);

    What I would like to do is collect stats on the entries that were passed on to ESOutPutFormat that got written
    to the ES
    cluster, similar to how one would collect the stats using BulkResponse, mostly around how many millis it took
    for the
    batch operation.

    For a new index to be populated with a bunch of docs via MR (e.g. daily logs), there is an easier way to do
    this. In the
    main MR thread, once the map tasks finish, I can call the Stats API on the index to get the stats up to the time
    of the
    API call. However,  when I am writing in batch from a Mapper-->ESOutPutFormat-->__RestRepository,  I would like
    to collect
    the stats at this time.   Is that possible without extending the current library (e.g. by introducing a set of MR
    Counters in ESOutPutFormat.java) ?

    Thanks!

    Abhijit

    --
    You received this message because you are subscribed to the Google Groups "elasticsearch" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to
    elasticsearch+unsubscribe@__googlegroups.com <mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
    To view this discussion on the web visit
    https://groups.google.com/d/__msgid/elasticsearch/7090aaee-__8436-4bfa-b78a-ebeb7cd4aff8%__40googlegroups.com
    <https://groups.google.com/d/msgid/elasticsearch/7090aaee-8436-4bfa-b78a-ebeb7cd4aff8%40googlegroups.com>.
    For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.


--
Costin

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/__topic/elasticsearch/__NF1sSaHzQU0/unsubscribe
<https://groups.google.com/d/topic/elasticsearch/NF1sSaHzQU0/unsubscribe>.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@__googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/__msgid/elasticsearch/52FB9379.__7070002%40gmail.com
<https://groups.google.com/d/msgid/elasticsearch/52FB9379.7070002%40gmail.com>.
For more options, visit https://groups.google.com/__groups/opt_out <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAPyBHrsGn_vb4V7m2Cg-gG8TwMva_2y1pDYBfBHKHsMSF-U8Lg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Costin

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53069C28.3040606%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6