I have to be able to index very large files and part of that includes indexing on their content. For the sake of this post let's assume all of my files are just .txt files. If my file is 1GB I will have trouble indexing it in its entirety into Elasticsearch. I have explored the bulk API and have also seen documentation on appending to arrays. However, in my case, my field is a string and not an array. They way I envision this (and I am trying to determine if this is even possible) would to break up indexing my file's content into smaller chunks, and using the bulk API to do some sort of insert on the same document each time (and each time appending to the content field for this same document). The only other way I see to do this is to just have multiple items that all contain the ID of the particular file but each containing a different section of its actual content. If I understand correctly, Elasticsearch should be able to handle 1GB + of content for a particular indexed document, but how I can get all of this data uploaded into Elasticsearch for one field remains a mystery and challenge for me.
I wouldn't recommend having a 1GB JSON document.
Using multiple documents, each representing a different section and having a common file ID, would seem to be the preferable option here.
Noted, thank you. That said, it is possible for ES to handle such size content fields, no? Is there a cap on size?
It might be possible but certainly not recommended.
The trouble with updating gigabyte-sized docs is they typically get to that size because they attract a lot of updates (eg. adding to the payment history of a very active account).
That compounds the problem because not only is it expensive to update, it likely happens frequently. And updates aren't cheap - they essentially delete the entire previous document and recreate a whole new one, reindexing all the contents again.
I happen to be certain that in my case, once all the content for the text file is indexed, it will never be modified. It just needs to be able to handle this 1GB+ content field. Additionally the speed of this upload into ES is less important than just being able to have all the content indexed.
It's definitely pushing things to the limits.
One of the benefits of breaking into smaller docs is that it should be much faster to highlight matching sections of the original document in your search results.
Great. Given that, I will probably then break up my large files into multiple entries each containing a chunk of the content, and give them all the same identifying data (such that they all appear to be the same file, with differing content which just means each is a particular section of a big file)