I’d appreciate your input on the following, please:
The business has a requirement to extract a lengthy string (text) from an external object — for example, a MOBI file — and store it in an Elasticsearch index as a searchable field. The challenge is that the string could be quite large, potentially in the region of 40–50 MB (which I think exceeds Lucene limits significantly) , and the business expects to be able to search within this field.
I imagine this isn’t the first time such a scenario has arisen, so I’m keen to understand what best practices are in place for handling this type of requirement.
I'm a little confused. Are you referring to a single token that is 40-50MB, or a text field containing multiple strings/tokens (this seems to be what you are implying)? If so, that is fine to store 40-50MB in a single text field in Elastic.
Perhaps you are confusing the single token limit in Lucene, which is 32K
Lucene still has a document limit (field limit) of about 2GB.
That said, there are other areas of concern, like actually returning the data due to HTTP limits, etc.
Perhaps take a look at this thread
Minor Side Note: Most of the text in a .mobi file is binary so not sure if that is just and example ... The Header etc is text but the actual text of the document is binary AFAIK.
Thank you, @stephenb
Yes, I am referring to the extracted text like
"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
which should be searchable i.e. if the customer is looking for "Excepteur sint occaecat cupidatat non proident" phrase it should return all the documents with this phrase, as one would expect.
We face significant performance issue when storing very long text and trying to search amongst millions of documents and I wonder if there is a recommended approach to optimise either the query or the storage (I guess nothing can be done with the latter) when querying this type of a text field?
1st Perhaps it may be that the search is quite fast, finding the document(s) that match the query, but then pulling those Gigantic Documents back out of Lucene, marshaling them up and then sending them back to the client... yeah... that could be quite non-performant.
We are doing on the Semantic search side is chunking up these big docs into digestible parts...
Does your user really want to pull back the entire 50 MBs to find the 3 sentences or paragraph that matches, along with the document name (could be section etc) ?
So perhaps a chunking, perhaps you do not want to do semantic search but you could do similar for normal search
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.