Hi ES Team,
This question may have been asked earlier. I am looking for ways to index really large documents like PDF, Word using ES. Our application crawls enterprise Active Directory system and we have hit some really large documents that we are unable to index as of now. We are seeing OutOfMemory Error in ES logs. I must also mention that for large files we pass a URL to a custom plugin running within ES server. This plugin reads the entire source file into memory from the provided URL before submitting an indexing request to ES. Of course this is not a great solution. The other option would be possibly to chunk the file and then send it to ES. But that will get indexed as a separate document, which won't work during search query. What is the suggestion from the community? Is there a API on ES which accepts the document source as a stream? I am giving the cluster details and steps on how we currently index what we we refer to as loose files (PDF, DOC, etc..)
Any recommendation would be much appreciated. Much thanks,
-Priyanka
Cluster Details:
2-node cluster. Each node is 16-CPU, 12G-RAM, 250G- HDD. 6G has been assigned to ES server.
// Call made to index the document, running within ES server via custom plugin.
BulkRequest bulkRequest - new BulkRequest();
IndexRequest request = new IndexRequest(indexName, "some_type", id);
/document is a byte stream of document to index. This byte stream tends to be really large for the large document./
indexReq.source(document);
buildRequest.add(request);
m_bulkAction.execute(bulkRequest, listener); // instance of TransportBulkAction