Elasticsearch parallel bulk using Python - issue with json

Hi All -

I am a newbie with ELasticsearch and I am encountering strange issue .
Specifications: I have a Json file of size: 0.5 GB
and I am using python 3.6 , ELasticsearch 6.3 version .
I am using parallel bulk call .

code:

try:
deque(helpers.parallel_bulk(es,read_json(filename),request_timeout=60,raise_on_error=True,raise_on_exception=True), maxlen=0)
except TransportError as e:
print(next(read_json(filename)))

issue#1: I am encountering message saying :
POST https://XXXXXXXXXXXXXXXXXX/_bulk [status:413 request:192.868s]

And encountering an exception and the job is failing /missing inserting some data .

How I can handle this one programatically?
How can i specify to print out/redirect the records that are getting dropped programatically?

issue#2:
when I am using another bigger Json file which is of 2 GB size ( which is larger compared to the previous one but exact same format) , it is not throwing any exceptions and inserting everything .

Am i missing something here ? Not sure what is the issue .

Any thoughts for me . I really appreciate all your time and help .

So, the format of the two JSON files may be the same, but the content obviously isn't. Elasticsearch is choking on document data from the smaller file because the data is "too large" (<--- status code 413 is REQUEST_ENTITY_TOO_LARGE).

To find out which record it is, you might be able to just do a quick look through the file to find the longest line(s), perhaps? (I'm not sure how they're stored, but I assume a doc per line).

Otherwise, you could just avoid the convenience of parallel_bulk(...) and code it up yourself, thereby finding which line blows it up.

Or, you could set a debug breakpoint in the Python code where the exception is caught to see the doc.

Or, you might be able to modify the Elasticsearch configuration to allow larger document payloads via REST.

Hope this helps.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.