Indexing a file of > 1GB

Hi,

For indexing content of files, we have written a Custom Index Action plugin. This plugin fetches (or streams) the content from a disk/network/repository, and indexes the whole file at one go from within the TransportAction implementation.

At ~300-500MB, we start observing OOM errors in the plugin, probably because of the huge file content in the memory.

I am guessing this should be a common problem that many may have already solved.

Thanks,
Sandeep

1 Like

Are you trying to index a single document > 1GB. If so, not trivial and you are treading on a somewhat challenging area that won't have clear best practices. Or are you breaking this document up into smaller documents and having problems?

Here's my standard answer to mammoth documents. You should structure your documents so that they can be meaningfully evaluated by your users as relevant/irrelevant. Even for books, for example, we break up content on semantically meaningful sections. You can still use field collapsing/grouping (like a top hits aggregration) to meaningfully display grouped results by a larger unit.

Then again you might not be solving a standard search problem, so I'd be curious to learn more.

+1 In the real life, when you look at a book and its index, you can see that actually pages have been indexed, not the full book.

Hi Doug, David,

Thanks for your replies.

I understand and agree with the concept of organizing large content. We are already doing something of that type. However, every once in a while, you get a single monolithic file which is really large and our Use Case is to index that file, no matter what.

Not just 1GB, but even with ~500megs, I have memory issues in the Custom Action plugin that we have written to fetch all of this content in memory and then index as a single file. My index action occurs from within a plugin in the EL node.

I could, theoretically, create multiple records, splitting this file into chunks, but that would complicate my search use cases.

I would like a solution that would allow streaming fetch and index of a chunk of data in the EL index at a time.

Any suggestions are appreciated.

Thanks,
Sandeep

The main obstacle here is how documents are examined for indexing.

Lucene API accepts a whole document only for several reasons. That is, it has not only to index word by word, which would be the naive way, but also has to create statistics of terms, frequencies per document and per segment (per index).

To create these information bits, the document has to be copied in memory as a whole. So you observe that 500m of a document do not fit into the default 1g heap. Maybe this process can be chunked in the future, but it is very hard, it would require a stateful logic in Lucene API so that Lucene can hold and continue on some points in indexing.

Having this in mind, the workaround is to ramp up real large expensive machines with several hundreds of gigabytes and crank up the JVM configuration to address this heap. It will take minutes if not hours to process the gigabyte-sized documents you talk about and ES will behave like stalled before completing the task.

But the story continues. After indexing (if in memory at once or chunked), it is close to impossible to retrieve the document out of the index. The field source size will be incredibly high and the memory used for delivery will drain resources from other search and index tasks so they tend to hang easily. There is absolutely no sense in that. If you imagine "then I just do no source and just highlighting on the field" can be the solution, this will not work, since the highlighter works on the complete field in the index which does not save any memory at all. It will even slow down the delivery of hits.

Maybe you have already found answers to ES search on large documents, then I'd be happy to know.

So if you ask me, the use case of "index files no matter what" is not rational with regard to ES indexing and document delivery because it has not been thought to the end. And therefore it does not deserve a solution on ES side.

So why not, beside indexing chunks, just storing the document on a filesystem, index the file metadata, and point to them with an URL? Or preprocessing the document and extracting only the most significant words?

Hi Jörg,

Thanks!

I understand. The use case is such! I believe splitting the document may be the only viable option at this time, since even if I increase the available memory, there would still be another document that may exceed this limit.

Anyways, it would be nice to have some chunking in Lucene. Wondering how people solve this problem (splitting seems to the obvious favorite, I guess).

Thanks again,
Sandeep