Append to an existing field (String)

bobsmith · July 26, 2019, 4:34pm

I have to be able to index very large files and part of that includes indexing on their content. For the sake of this post let's assume all of my files are just .txt files. If my file is 1GB I will have trouble indexing it in its entirety into Elasticsearch. I have explored the bulk API and have also seen documentation on appending to arrays. However, in my case, my field is a string and not an array. They way I envision this (and I am trying to determine if this is even possible) would to break up indexing my file's content into smaller chunks, and using the bulk API to do some sort of insert on the same document each time (and each time appending to the content field for this same document). The only other way I see to do this is to just have multiple items that all contain the ID of the particular file but each containing a different section of its actual content. If I understand correctly, Elasticsearch should be able to handle 1GB + of content for a particular indexed document, but how I can get all of this data uploaded into Elasticsearch for one field remains a mystery and challenge for me.

Mark_Harwood · July 26, 2019, 4:41pm

I wouldn't recommend having a 1GB JSON document.
Using multiple documents, each representing a different section and having a common file ID, would seem to be the preferable option here.

bobsmith · July 26, 2019, 4:42pm

Noted, thank you. That said, it is possible for ES to handle such size content fields, no? Is there a cap on size?

Mark_Harwood · July 26, 2019, 4:49pm

It might be possible but certainly not recommended.
The trouble with updating gigabyte-sized docs is they typically get to that size because they attract a lot of updates (eg. adding to the payment history of a very active account).
That compounds the problem because not only is it expensive to update, it likely happens frequently. And updates aren't cheap - they essentially delete the entire previous document and recreate a whole new one, reindexing all the contents again.

bobsmith · July 26, 2019, 4:53pm

I happen to be certain that in my case, once all the content for the text file is indexed, it will never be modified. It just needs to be able to handle this 1GB+ content field. Additionally the speed of this upload into ES is less important than just being able to have all the content indexed.

Mark_Harwood · July 26, 2019, 4:56pm

It's definitely pushing things to the limits.
One of the benefits of breaking into smaller docs is that it should be much faster to highlight matching sections of the original document in your search results.

bobsmith · July 26, 2019, 4:59pm

Great. Given that, I will probably then break up my large files into multiple entries each containing a chunk of the content, and give them all the same identifying data (such that they all appear to be the same file, with differing content which just means each is a particular section of a big file)

system · August 23, 2019, 4:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
[Java] Stream large file while indexing Elasticsearch	10	2236	July 6, 2017
Large string fields Elasticsearch	6	4744	February 15, 2017
Append to existing field Elasticsearch	10	36926	November 4, 2022
[elastic/elasticsearch] Cannot bulk index a JSON file greater than 100MB in Elasticsearch. Tried changing HTTP content length but it doesn't work Elasticsearch	14	2156	September 18, 2018
Indexing via Bulk API Elasticsearch	2	276	July 6, 2017

Append to an existing field (String)

Related topics