Bulk API Rejected records but I can't see a reason

Hi folks, I am trying to index a fairly high-volume log source, and am using the ElasticNEST client in a C# app to do so. On most of my batches, I am getting an Invalid NEST response error.

As far as I can see, the error is always CausedBy: "", which is not particularly enlightening. There are null values (strings) in my objects, but I would have thought that was fine?

Invalid NEST response built from a successful low level call on POST: /powermta-general-2017.03.13/_bulk
# Invalid Bulk items:
  operation[0]: index returned 429 _index: powermta-general-2017.03.13 _type: object _id: AVrIPNQBs9B18ytIfFDC _version: 0 error: Type: es_rejected_execution_exception Reason: "rejected execution of org.elasticsearch.transport.TransportService$7@4582ee19 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@46b4acda[Running, pool size = 2, active threads = 2, queued tasks = 50, completed tasks = 762653]]" CausedBy: ""
# Audit trail of this API call:
 - [1] PingSuccess: Node: https://cloudclusternode.eu-west-1.aws.found.io:9243/ Took: 00:00:00.2298987
 - [2] HealthyResponse: Node: https://cloudclusternode.eu-west-1.aws.found.io:9243/ Took: 00:00:00.2854329
# Request:
{"index":{"_type":"object","_id":null}}
{"beat":{"hostname":"MARKDESKTOP","filename":"\\\\ca-mta01\\e$\\logs\\process\\acct-2017-03-13-0199.csv"},"dlv":{"destinationIp":"1.2.3.4","esmtpAvailable":"ENHANCEDSTATUSCODES,PIPELINING,8BITMIME,SIZE,DELIVERBY,STARTTLS","size":0,"sourceIp":"37.221.216.177","type":"smtp"},"dsn":{"action":"relayed","diag":"smtp;250 2.0.0 v2DEgCup011544 Message accepted for delivery","diagNormalized":"250 2.0.0 v2DEgCup011544 Message accepted for delivery","diagCode":250,"mta":{"name":"al-ip4-mx-vip2.mailserver.net","address":"144.160.235.144","provider":"other","domain":"mailserver.net"},"status":"2.0.0 (success)","statusCode":200},"header":{},"job":{"name":"emaildomainlz_46","id":0,"instance":"emaildomain","campaignId":46},"src":{"type":"api"},"recordType":{"code":"d","name":"delivered"},"vmta":{"name":"emaildomainlz"},"message":"d,2017-03-13 14:42:22+0000,2017-03-13 14:41:18+0000,emaildomainlz.F.6495.46.0@emaildomain.sgml2.com,jmraft@destinationdomain.net,,relayed,2.0.0 (success),smtp;250 2.0.0 v2DEgCup011544 Message accepted for delivery,al-ip4-mx-vip2.mailserver.net (144.160.235.144),,api,,smtp,37.221.216.177,144.160.235.144,\"ENHANCEDSTATUSCODES,PIPELINING,8BITMIME,SIZE,DELIVERBY,STARTTLS\",47489,emaildomainlz,emaildomainlz_46,,destinationdomain.net/emaildomainlz,,,,,,,,,,,,,,","orig":{"address":"emaildomainlz.f.6495.46.0@emaildomain.com","domain":"emaildomain.com"},"queue":"destinationdomain.net/emaildomainlz","rcpt":{"address":"user@destinationdomain.net","domain":"destinationdomain.net"},"timeLogged":"2017-03-13T14:42:22+00:00","timeQueued":"2017-03-13T14:41:18+00:00"}

# Response:
{"took":0,"errors":true,"items":[{"index":{"_index":"powermta-general-2017.03.13","_type":"object","_id":"AVrIPNQBs9B18ytIfFDC","status":429,"error":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7@4582ee19 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@46b4acda[Running, pool size = 2, active threads = 2, queued tasks = 50, completed tasks = 762653]]"}}}]}

(Currently using NEST v5.2.0 and Elasticsearch.net v5.2.0)

Aw, this just got weird.

So I thought I'd run this through manually on my dev machine to see what's going on. Same thing happens, errors on every batch, even if batch size is 1.

I started Fiddler to see if I could maybe look at the requests and see if I'm doing anything weird.

Everything started working.

It runs fine if I have Fiddler open, and fails as soon as I close it. Starts working again as soon as I start up Fiddler.

I have tried it again now with a single object using Index(), and I get (possibly) a bit more information:

The interesting bit is:

# OriginalException: System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
   at System.Net.HttpWebRequest.GetResponse()

and

"Failed to acknowledge mapping update within [30s]"}

The full message response is:

Invalid NEST response built from a unsuccessful low level call on POST: /powermta-2017.03.13/account
# Audit trail of this API call:
 - [1] PingSuccess: Node: https://cloudclusternode.eu-west-1.aws.found.io:9243/ Took: 00:00:02.5455519
 - [2] BadResponse: Node: https://cloudclusternode.eu-west-1.aws.found.io:9243/ Took: 00:00:30.5482281
# ServerError: ServerError: 500Type: timeout_exception Reason: "Failed to acknowledge mapping update within [30s]"
# OriginalException: System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
   at System.Net.HttpWebRequest.GetResponse()
   at Elasticsearch.Net.HttpConnection.Request[TReturn](RequestData requestData) in C:\code\elasticsearch-net\src\Elasticsearch.Net\Connection\HttpConnection.cs:line 145
# Request:
{"beat":{"hostname":"MARKDESKTOP","filename":"\\\\servername\\logpath\\process\\acct-2017-03-13-0199.csv"},"dlv":{"size":0},"dsn":{"diagCode":0,"statusCode":0},"header":{},"job":{"name":"customer_1321","id":0,"instance":"customer","campaignId":1321},"src":{"type":"api"},"recordType":{"code":"r","name":"receipt"},"vmta":{"name":"customer"},"message":"r,2017-03-13 14:45:28+0000,,customerlz.F.63389.1321.0@mail.customer.co.uk,carl@customer.co.uk,,,,,,,api,,,,,,,customer,customer_1321,,,,,,,,,,,,,,,,","orig":{"address":"customerlz.f.63389.1321.0@mail.customer.co.uk","domain":"mail.customer.co.uk"},"rcpt":{"address":"carl@customer.co.uk","domain":"customer.co.uk"},"timeLogged":"2017-03-13T14:45:28+00:00","timeQueued":"0001-01-01T00:00:00"}
# Response:
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[instance-0000000013][10.83.21.157:19132][indices:data/write/index[p]]"}],"type":"timeout_exception","reason":"Failed to acknowledge mapping update within [30s]"},"status":500}

Hi, The first error you posted is due to the bulk queue filling up. If you look at this part of the response it tells you that the queue is full

Running, pool size = 2, active threads = 2, queued tasks = 50, completed tasks = 76265
Also, Http status code 429 means "Too many requests". This happens when elasticsearch is unable to process incoming requests. There is a bulk queue size of 50 by default. The easiest option is to upgrade your machine, adding more CPU will probably help.

If you want to check the status of the bulk queue, you can use the cat-thread-pool API

I have no idea what your other error message means, about the "acknowledge mapping update"

1 Like

This is an elastic cloud cluster, I have no control over the number of CPU.

Where did you get the 429 status from? I'm not seeing that anywere.

You can increase the size of the cluster in elastic cloud, which will increase the capacity and thus most likely make this error disappear. The other option is to continuously monitor bulk thread queue while indexing. I bet you can find 429 in the first log message if you look closely :slight_smile:

Aah, I thought you meant an HTTP response.

The answer is always "just increase the cluster" as if money isn't actually relevant to anyone. As it stands right now though, I can't actually push enough data into the cluster in a month to fill the quota, hence my suspicion that something odd is going on. (And we're not indexing any weird or huge documents here)

It's a little odd, I have now turned off all heavy ingress, and it still takes about 30 seconds for _cat/thread_pool to return, and the queue is always at 50

node_name           name                active queue rejected
instance-0000000013 bulk                     2    50   192074

I'm going to reboot the cluster I think, and see if that clears things up.

What does CPU usage look like during indexing? What is the current size of your cluster? What indexing rate are you seeing? What is the average size of your events once transformed into JSON?

So the cluster just came back from the restart, and seems to be running nicely. It is an 8Gb RAM/192Gb storage cluster, unspecified CPU. I have just started up my main data source, which has a fair bit of backup to get through, and it is indexing between 5,000 and 7,000 documents per second. CPU utilisation now that initial surge is over with seems to be sitting at about 80%. The document size as stored in ES (I'm not sure what the JSON size is) is about 1.2Kb. I got that from today's index being 572MB, and having 470k documents in it.

All I can think of is that the single non-beat ingest I have to our cluster (which I created) is not particularly smart yet, so no doubt has been ignoring the 429 'busy' status that you pointed out, resulting no doubt in continued spamming. I will have to watch for those, and enter some sort of back-off when they are seen.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.