Bulk Request Monitoring

According to the doc, it's important to monitor bulk requests to tune ES-Hadoop ingest. I think it will be beneficial to set up monitoring around the requests for production operation as well.

I've tried few different routes to gather the metrics:

  1. Security audit log and filter successful authentication for bulk request.
  2. From Kibana's monitoring plugin, and go to index's advanced view, and use Request Rate and Request Time visualization.
  3. Enable debugging on org.elasticsearch.hadoop.rest.bulk

It looks like option 3 has the richest information, but it's still not enough to answer questions like:

  • What's the average bulk request response time?
  • How many bulk request is es-hadoop job sending per second?
  • How many rejection is happening?

As a workaround, I implemented a simple log parser that can correlate request created event and request response event.

Is there a easier way to gather and consume these metrics?

  1. Through Kibana's monitoring plugin?
  2. Expose these metrics as custom Spark metrics?

@james.baiera any thought? Should I open an issue with feature request on Github?

I'm not sure on any ways that exist at the moment for collecting these metrics. The correlation id that gets logged with the bulk request isn't really used anywhere outside of ES-Hadoop - it's just meant as a way to correlate bulk requests and responses together.

In Hadoop-based integrations (MR, Hive, Pig), we integrate with Hadoop's built in counters on the job context to accumulate statistics from the job. The stats that we collect are found at org.elasticsearch.hadoop.rest.stats.Stats. I'm not aware of any specific mechanisms in Spark for collecting these sorts of statistics. In the past I've thought about potentially registering a handful of accumulators on the job that would collect these stats. All that said, I haven't got around to testing that out or doing much research into the space at all.

  • What's the average bulk request response time?
  • How many bulk request is es-hadoop job sending per second?
  • How many rejection is happening?

These should be calculable from the existing stats collected, they're just not surfaced. Average bulk time would just be total time divided by bulk total. Bulks per second would be bulk total divided by the job's wall clock time. Rejections are not tracked exactly, but could be added, or could be generally inferred from the bulk retries statistic, since rejections are almost always the cause of a retry.

I would say it's probably worth opening an issue in Github for this - Just something like "Find a way to surface Stats data in Spark integration"