A very interesting puzzle.
I presume you are just indexing as a text field? Using just standard analyzers? So at index time the entire (large) field has to be tokenized, which would be a little time consuming.
Are there other (similar) docs in same index that have indexed fine?
I'd say there are 3 main possibilities here
- the doc is actually broken in some subtle way that is not obvious
- elasticsearch won't index it for reasons currently unknown, though it is completely valid json - i.e. some hidden limit is exceeded which we dont know about
- something else in the various bits, e.g. some internal library, is barfing,
I'd maybe also try to attack problem iteratively, something like
- index the doc without the large text field, make sure thats OK
- index the doc with first 1000 characters from large text field, make sure that is OK
- index the doc with first 2000 characters from large text field, make sure that is OK
- index the doc with first 3000 characters from large text field, make sure that is OK
- ...
and see when it starts failing and if that gives a clue as to why.
Last idea, feed the very long string into some other little checker, to make sure there's nothing funny within it that might be breaking things.
e.g. from python
>>> from collections import Counter
>>> counter = Counter("Hello I am having trouble diagnosing and resolving a large document upload issue...")
>>> counter
Counter({' ': 12, 'a': 7, 'e': 6, 'l': 6, 'o': 6, 'n': 6, 'i': 5, 'g': 5, 'u': 4, 'd': 4, 's': 4, 'r': 3, '.': 3, 'm': 2, 'v': 2, 't': 2, 'H': 1, 'I': 1, 'h': 1, 'b': 1, 'c': 1, 'p': 1})
or
>>> from collections import Counter
>>> with open('/usr/share/dict/words', 'r') as file:
data = file.read().replace('\n', ' ')
...
>>> counter = Counter(data)
>>> counter
Counter({' ': 101924, 's': 93865, 'e': 90061, 'i': 68049, 'a': 65452, 'n': 58114, 'r': 58112, 't': 53122, 'o': 50054, 'l': 41173, 'c': 31031, "'": 29061, 'd': 28253, 'u': 26969, 'g': 22356, 'p': 21601, 'm': 21435, 'h': 19267, 'b': 14625, 'y': 12870, 'f': 10321, 'k': 8155, 'v': 7803, 'w': 7237, 'x': 2137, 'z': 2014, 'M': 1778, 'S': 1676, 'C': 1646, 'q': 1501, 'B': 1499, 'j': 1481, 'A': 1450, 'P': 1063, 'L': 948, 'H': 929, 'T': 912, 'D': 862, 'G': 852, 'R': 786, 'K': 682, 'E': 657, 'N': 591, 'J': 560, 'F': 544, 'W': 539, 'O': 413, 'I': 365, 'V': 356, '\xc3': 271, 'Z': 161, 'Y': 154, '\xa9': 148, 'U': 140, 'Q': 82, 'X': 46, '\xa8': 29, '\xb6': 17, '\xbc': 14, '\xa1': 12, '\xb3': 10, '\xb1': 8, '\xa2': 6, '\xaa': 6, '\xa7': 5, '\xa4': 4, '\xa5': 3, '\xbb': 3, '\x85': 2, '\xad': 2, '\xb4': 2})
(I didn't know there were some umlauts in /usr/share/dict/words !).