502 error when indexing a large (?) document

Hello I am having trouble diagnosing and resolving a large document upload issue...

I'm using elasticsearch SaaS and there is a document which is approximately 42MB, when my software attempts to index it I get an error:

"{Invalid NEST response built from a unsuccessful (502) low level call on PUT: /document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3}"

I'm not sure where to go from here, I would have thought there would be more details somewhere in logs somewhere but I'm not sure where to find those in Elasticsearch SaaS?

Here is some more detail from the software side of things...

# FailureReason: BadResponse while attempting PUT on https://da1166f5fabb431e83cab7c7aec3cb0a.australiaeast.azure.elastic-cloud.com/document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3
# Audit trail of this API call:
 - [1] BadResponse: Node: https://da1166f5fabb431e83cab7c7aec3cb0a.australiaeast.azure.elastic-cloud.com/ Took: 00:01:24.8985417
# OriginalException: Elasticsearch.Net.ElasticsearchClientException: Request failed to execute. Call: Status code 502 from: PUT /document/_doc/GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqLBLoAg3
# Request:
{"uid":"GyIAAERPFiO3CbRqgqB8ojQEGImqU4GmI_r4f5xNqL...snip...

Hi @catmanjan ,

what version of Elasticsearch is running on your server and what version of NEST/Elastic.Clients.Elasticsearch are you using?

If you check the complete request, is the document serialized correctly and valid JSON?

Elastic Cloud 8.14.1, NEST 7.17.5

Yes it is correct and valid JSON

Any idea where I go from here?

is there anything in Elasticsearch log?

I can't see anything relevant, but I'm not actually sure if this is the right log... do you know if I should be looking somewhere else?

Well, 1-20 (and I can only see 1) from 500+ log entries in last 24 hours. You need check for something that might be relevant.

By the way, a 42MB json document? How many fields/values are there in such a document? That’s a lot of stuff for a single document. I’d say unusually large.

Wild idea - spin up a local (say on your laptop) instance of Elasticsearch and see if you can index same document, and if not what the elasticsearch.log says there.

Yes I tried to find a relevant log but they are all similar, no specific errors related to indexing

The document is only a dozen fields, but one of them is the extracted text content of a large file...

I'll try it locally

A very interesting puzzle.

I presume you are just indexing as a text field? Using just standard analyzers? So at index time the entire (large) field has to be tokenized, which would be a little time consuming.

Are there other (similar) docs in same index that have indexed fine?

I'd say there are 3 main possibilities here

  1. the doc is actually broken in some subtle way that is not obvious
  2. elasticsearch won't index it for reasons currently unknown, though it is completely valid json - i.e. some hidden limit is exceeded which we dont know about
  3. something else in the various bits, e.g. some internal library, is barfing,

I'd maybe also try to attack problem iteratively, something like

  • index the doc without the large text field, make sure thats OK
  • index the doc with first 1000 characters from large text field, make sure that is OK
  • index the doc with first 2000 characters from large text field, make sure that is OK
  • index the doc with first 3000 characters from large text field, make sure that is OK
  • ...

and see when it starts failing and if that gives a clue as to why.

Last idea, feed the very long string into some other little checker, to make sure there's nothing funny within it that might be breaking things.

e.g. from python

>>> from collections import Counter
>>> counter = Counter("Hello I am having trouble diagnosing and resolving a large document upload issue...")
>>> counter
Counter({' ': 12, 'a': 7, 'e': 6, 'l': 6, 'o': 6, 'n': 6, 'i': 5, 'g': 5, 'u': 4, 'd': 4, 's': 4, 'r': 3, '.': 3, 'm': 2, 'v': 2, 't': 2, 'H': 1, 'I': 1, 'h': 1, 'b': 1, 'c': 1, 'p': 1})

or

>>> from collections import Counter
>>> with open('/usr/share/dict/words', 'r') as file:
    data = file.read().replace('\n', ' ')
...
>>> counter = Counter(data)
>>> counter
Counter({' ': 101924, 's': 93865, 'e': 90061, 'i': 68049, 'a': 65452, 'n': 58114, 'r': 58112, 't': 53122, 'o': 50054, 'l': 41173, 'c': 31031, "'": 29061, 'd': 28253, 'u': 26969, 'g': 22356, 'p': 21601, 'm': 21435, 'h': 19267, 'b': 14625, 'y': 12870, 'f': 10321, 'k': 8155, 'v': 7803, 'w': 7237, 'x': 2137, 'z': 2014, 'M': 1778, 'S': 1676, 'C': 1646, 'q': 1501, 'B': 1499, 'j': 1481, 'A': 1450, 'P': 1063, 'L': 948, 'H': 929, 'T': 912, 'D': 862, 'G': 852, 'R': 786, 'K': 682, 'E': 657, 'N': 591, 'J': 560, 'F': 544, 'W': 539, 'O': 413, 'I': 365, 'V': 356, '\xc3': 271, 'Z': 161, 'Y': 154, '\xa9': 148, 'U': 140, 'Q': 82, 'X': 46, '\xa8': 29, '\xb6': 17, '\xbc': 14, '\xa1': 12, '\xb3': 10, '\xb1': 8, '\xa2': 6, '\xaa': 6, '\xa7': 5, '\xa4': 4, '\xa5': 3, '\xbb': 3, '\x85': 2, '\xad': 2, '\xb4': 2})

(I didn't know there were some umlauts in /usr/share/dict/words !).

Okay I think I have some more useful information - turns out the document is related to GPS coordinates, if I shrink it (as RainTown suggested) below 32MB it works fine.

If it is larger than 32MB it seems to crash (?) elasticsearch in the cloud and I get a geoip related error message shortly after it recovers.

Could I ask someone at Elasticsearch to log a ticket about this?

In the meantime, is there a way to tell Elasticsearch not to attempt doing whatever it is doing with the GPS coordinates?

I've uploaded the file here: large document - original.TXT

The document is only a dozen fields, but one of them is the extracted text content of a large file...

turns out the document is related to GPS coordinates

Er, thats quite different. :wink:

Well, if I were you, I'd open a ticket, as a crash is not right. Someone in support can tell you if what you are trying to do is supported, maybe some 32MB limit is documented somewhere.

But I'd personally need some convincing this is a good design, so in parallel you should have a little think about that too. IMHO.

Unfortunately I don't get to choose the contents of the documents, its just all the documents in the organisations, the text is extracted with Tika and uploaded to Elastic...

(This is not a technical answer)

It depends on how you consider your role, to a large extent what leverage you have, and what level of responsibility you have over the overall solution.

If Elastic tell you "well, thats not supported" for whatever reason, or "Ok, its a bug, we will open a bug on it, that might get fixed in N months" what are you going to do?

I think of myself as a bit of a chef - I could only do so much with terrible ingredients. And I really really want my food to taste good, and I mean actually taste good, not "well, as good as it can be, given the ingredients I was given". :slight_smile: