Recommendation for indexing a large size document < 1G

Hi ES Team,

This question may have been asked earlier. I am looking for ways to index really large documents like PDF, Word using ES. Our application crawls enterprise Active Directory system and we have hit some really large documents that we are unable to index as of now. We are seeing OutOfMemory Error in ES logs. I must also mention that for large files we pass a URL to a custom plugin running within ES server. This plugin reads the entire source file into memory from the provided URL before submitting an indexing request to ES. Of course this is not a great solution. The other option would be possibly to chunk the file and then send it to ES. But that will get indexed as a separate document, which won't work during search query. What is the suggestion from the community? Is there a API on ES which accepts the document source as a stream? I am giving the cluster details and steps on how we currently index what we we refer to as loose files (PDF, DOC, etc..)
Any recommendation would be much appreciated. Much thanks,

-Priyanka

Cluster Details:
2-node cluster. Each node is 16-CPU, 12G-RAM, 250G- HDD. 6G has been assigned to ES server.

// Call made to index the document, running within ES server via custom plugin.
BulkRequest bulkRequest - new BulkRequest();
IndexRequest request = new IndexRequest(indexName, "some_type", id);
/document is a byte stream of document to index. This byte stream tends to be really large for the large document./
indexReq.source(document);
buildRequest.add(request);

m_bulkAction.execute(bulkRequest, listener); // instance of TransportBulkAction

What kind of documents are these? If its really 1 GB of text, then as a search result its not particularly useful as a document. If its a book, why not break up by chapter or some other meaningful subdivision of the text? You can then rollup results using a top-hits aggregration. If its a book, such an aggregration would give you results that look like this:

  • Book: Relevant Search:
  • Chapter 1 (most relevant chapter)
  • Chapter 5 (next most relevant chapter)

etc

Consider too a couple of factors: It's often not advisable to give a single JVM more than maybe 16 GB of RAM. You're giving it 6G of RAM. Indexing docs of this size and passing it through all of Elasticsearch and Lucene's data structures (commit logs, stored ields, inverted indices, etc) isn't going to be easy. This isn't a use case I imagine is optimized for by search.

All that's not to say you're definitely wrong, I just want to learn more about the nature of the documents so that you're certain that this really is the right path :slight_smile:

Much thanks for your reply. We are expecting to see mostly PDF or WORD documents or possibly HTML but I don't recall the last one. Most definitely not books. Since our customers are Legal firms, enterprise (crawling from their archived user folders), Government, etc., we are not expected to see these documents in any particular format. Ideally we don't want to set a hard limit within our application on the size of the document we are able to index.

There is another use case. We could be also indexing smaller files but in parallel. Smaller files like 50MB, indexing 20-30 in parallel. This could result in indexing large size but not as a single document. We are seeing OutOfMemory errors in such scenario as well.

I know Lucene's IndexWriter has the ability to read the characters from a Java InputStream when documents are initially added to the index, and so they can come from files, databases, web service calls, etc. Lucene also handles closing of stream on behalf of the caller. Would you happen to know if any similar API is exposed in ElasticSearch wrapped over Lucene?

IndexWriter writer = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(... fileInputStream ...));

Document document = new Document();
document.add(new StringField("title", fileName, Store.YES));
document.add(new TextField("body", reader));

writer.addDocument(document);

If the documents were chunked into separate documents in ES could there not be some way to search all of the chucks as one doc. In other words flag them in some way that the API sees them as a single doc.
I'm also looking to index very large docs some of these would be source code that would be search by experts. Huge and I mean, Huge!, contracts etc..