Error Hadoop/ElasticSearch Too Many Requests

(Guillermo Ortiz) #1

I'm executing Spark againts ElasticSearch using the ElasticSearch API.

I have 6 executors with one core each one. There are not queued tasks. I only have two ElasticNodes with 8 cores and 32 Gb but it seems that they should handle that traffic.

I have checked the elasticsearch logs as well but there aren't any log.

Right now, I have reduce the number of executors to 3 to see what it happens.

Is it really too many producers? it seems that there are not since they are not queued tasks and I checked as well the CPU usage for the ElasticSearch nodes and it's about 30%.

User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 82742.0 failed 4 times, most recent failure: Lost task 4.3 in stage 82742.0 (TID 262382,xxxx): Found unrecoverable error [xxx:9200] returned Too Many Requests(429) - rejected execution of org.elasticsearch.transport.TransportService$4@2c70992a on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@294f7f8b[Running, pool size = 8, active threads = 8, queued tasks = 50

(Guillermo Ortiz) #2

I changed the number of executors to three executors, one core each one and after 10 hours I got the same error.

Any idea?

(Costin Leau) #3

It looks like you are actually using ES-Hadoop.

Unfortunately ES seems to be failing behind - slowly but surely. In this case, it seems after 10h. You could of course, use 2 executors however I strongly recommend monitoring the cluster to understand what's the cause of it:

a. does the cluster remain out of memory and the GC cause the nodes to slow down
b. based on your initial error, it looks like the queue grows to big - this means indexing is starting to slow down; maybe the disks are too slow?
c. any other processes that are stealing CPU and IO?

Considering the long time - 10h - maybe there's some external processes (crontab, antivirus, etc...) that kicks in while you are ingesting data, and causes the OS to slow down which in turn affects ES which ends up rejecting requests and thus aborting the job.

(system) #4