502 error when indexing a large (?) document

catmanjan · January 15, 2025, 3:33am

Hello I am having trouble diagnosing and resolving a large document upload issue...

I'm using elasticsearch SaaS and there is a document which is approximately 42MB, when my software attempts to index it I get an error:

"{Invalid NEST response built from a unsuccessful (502) low level call on PUT: /document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3}"

I'm not sure where to go from here, I would have thought there would be more details somewhere in logs somewhere but I'm not sure where to find those in Elasticsearch SaaS?

catmanjan · January 15, 2025, 4:24am

Here is some more detail from the software side of things...

# FailureReason: BadResponse while attempting PUT on https://da1166f5fabb431e83cab7c7aec3cb0a.australiaeast.azure.elastic-cloud.com/document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3
# Audit trail of this API call:
 - [1] BadResponse: Node: https://da1166f5fabb431e83cab7c7aec3cb0a.australiaeast.azure.elastic-cloud.com/ Took: 00:01:24.8985417
# OriginalException: Elasticsearch.Net.ElasticsearchClientException: Request failed to execute. Call: Status code 502 from: PUT /document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3
# Request:
{"uid":"GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqL...snip...

flobernd · January 15, 2025, 8:23am

Hi @catmanjan ,

what version of Elasticsearch is running on your server and what version of NEST/Elastic.Clients.Elasticsearch are you using?

If you check the complete request, is the document serialized correctly and valid JSON?

catmanjan · January 15, 2025, 10:27am

Elastic Cloud 8.14.1, NEST 7.17.5

Yes it is correct and valid JSON

catmanjan · January 19, 2025, 10:04pm

Any idea where I go from here?

RainTown · January 20, 2025, 12:01am

is there anything in Elasticsearch log?

catmanjan · January 20, 2025, 1:30am

I can't see anything relevant, but I'm not actually sure if this is the right log... do you know if I should be looking somewhere else?

RainTown · January 20, 2025, 3:09am

Well, 1-20 (and I can only see 1) from 500+ log entries in last 24 hours. You need check for something that might be relevant.

By the way, a 42MB json document? How many fields/values are there in such a document? That’s a lot of stuff for a single document. I’d say unusually large.

Wild idea - spin up a local (say on your laptop) instance of Elasticsearch and see if you can index same document, and if not what the elasticsearch.log says there.

catmanjan · January 20, 2025, 3:26am

Yes I tried to find a relevant log but they are all similar, no specific errors related to indexing

The document is only a dozen fields, but one of them is the extracted text content of a large file...

I'll try it locally

RainTown · January 20, 2025, 1:15pm

A very interesting puzzle.

I presume you are just indexing as a text field? Using just standard analyzers? So at index time the entire (large) field has to be tokenized, which would be a little time consuming.

Are there other (similar) docs in same index that have indexed fine?

I'd say there are 3 main possibilities here

the doc is actually broken in some subtle way that is not obvious
elasticsearch won't index it for reasons currently unknown, though it is completely valid json - i.e. some hidden limit is exceeded which we dont know about
something else in the various bits, e.g. some internal library, is barfing,

I'd maybe also try to attack problem iteratively, something like

index the doc without the large text field, make sure thats OK
index the doc with first 1000 characters from large text field, make sure that is OK
index the doc with first 2000 characters from large text field, make sure that is OK
index the doc with first 3000 characters from large text field, make sure that is OK
...

and see when it starts failing and if that gives a clue as to why.

Last idea, feed the very long string into some other little checker, to make sure there's nothing funny within it that might be breaking things.

e.g. from python

>>> from collections import Counter
>>> counter = Counter("Hello I am having trouble diagnosing and resolving a large document upload issue...")
>>> counter
Counter({' ': 12, 'a': 7, 'e': 6, 'l': 6, 'o': 6, 'n': 6, 'i': 5, 'g': 5, 'u': 4, 'd': 4, 's': 4, 'r': 3, '.': 3, 'm': 2, 'v': 2, 't': 2, 'H': 1, 'I': 1, 'h': 1, 'b': 1, 'c': 1, 'p': 1})

or

>>> from collections import Counter
>>> with open('/usr/share/dict/words', 'r') as file:
    data = file.read().replace('\n', ' ')
...
>>> counter = Counter(data)
>>> counter
Counter({' ': 101924, 's': 93865, 'e': 90061, 'i': 68049, 'a': 65452, 'n': 58114, 'r': 58112, 't': 53122, 'o': 50054, 'l': 41173, 'c': 31031, "'": 29061, 'd': 28253, 'u': 26969, 'g': 22356, 'p': 21601, 'm': 21435, 'h': 19267, 'b': 14625, 'y': 12870, 'f': 10321, 'k': 8155, 'v': 7803, 'w': 7237, 'x': 2137, 'z': 2014, 'M': 1778, 'S': 1676, 'C': 1646, 'q': 1501, 'B': 1499, 'j': 1481, 'A': 1450, 'P': 1063, 'L': 948, 'H': 929, 'T': 912, 'D': 862, 'G': 852, 'R': 786, 'K': 682, 'E': 657, 'N': 591, 'J': 560, 'F': 544, 'W': 539, 'O': 413, 'I': 365, 'V': 356, '\xc3': 271, 'Z': 161, 'Y': 154, '\xa9': 148, 'U': 140, 'Q': 82, 'X': 46, '\xa8': 29, '\xb6': 17, '\xbc': 14, '\xa1': 12, '\xb3': 10, '\xb1': 8, '\xa2': 6, '\xaa': 6, '\xa7': 5, '\xa4': 4, '\xa5': 3, '\xbb': 3, '\x85': 2, '\xad': 2, '\xb4': 2})

(I didn't know there were some umlauts in /usr/share/dict/words !).

catmanjan · January 21, 2025, 12:41am

Okay I think I have some more useful information - turns out the document is related to GPS coordinates, if I shrink it (as RainTown suggested) below 32MB it works fine.

If it is larger than 32MB it seems to crash (?) elasticsearch in the cloud and I get a geoip related error message shortly after it recovers.

Could I ask someone at Elasticsearch to log a ticket about this?

In the meantime, is there a way to tell Elasticsearch not to attempt doing whatever it is doing with the GPS coordinates?

I've uploaded the file here: large document - original.TXT

RainTown · January 21, 2025, 5:18am

The document is only a dozen fields, but one of them is the extracted text content of a large file...

turns out the document is related to GPS coordinates

Er, thats quite different.

Well, if I were you, I'd open a ticket, as a crash is not right. Someone in support can tell you if what you are trying to do is supported, maybe some 32MB limit is documented somewhere.

But I'd personally need some convincing this is a good design, so in parallel you should have a little think about that too. IMHO.

catmanjan · January 21, 2025, 5:42am

Unfortunately I don't get to choose the contents of the documents, its just all the documents in the organisations, the text is extracted with Tika and uploaded to Elastic...

RainTown · January 21, 2025, 6:09am

(This is not a technical answer)

It depends on how you consider your role, to a large extent what leverage you have, and what level of responsibility you have over the overall solution.

If Elastic tell you "well, thats not supported" for whatever reason, or "Ok, its a bug, we will open a bug on it, that might get fixed in N months" what are you going to do?

I think of myself as a bit of a chef - I could only do so much with terrible ingredients. And I really really want my food to taste good, and I mean actually taste good, not "well, as good as it can be, given the ingredients I was given".

Topic		Replies	Views
Troubleshooting 502 Errors During Elasticsearch Indexing Elasticsearch	1	15	January 21, 2025
502 Error with add data Elasticsearch	7	3438	January 16, 2018
When indexing a VERY large text document (50 to 200MB) NEST throws WebException: The request was aborted: The request was canceled Elasticsearch	7	1806	October 25, 2017
Search all docs into index in nest 2.1.1 Elasticsearch	5	1050	July 5, 2017
Unable to index a file (Word document) greater than 45 MB Elasticsearch	6	556	June 3, 2021

502 error when indexing a large (?) document

Related topics