I've run into an issue where searching through a relatively small data set is hitting some pretty slow performance. We're running queries for text matches like "Person Name" against a dataset of around 8GB. The problem is that this data is ingested documents like word, pdf's, or excel docs. Some of the content fields have 16+ million characters in them. Does anybody have advice on how to handle fields with such a large character count?
Any help or advice is much appreciated!
Thanks,
Jason
Hi Jason, thanks for posting your question! Text fields should be tokenized by default which will optimize search performance to the point where 16M characters in a text field shouldn't cause a problem.
Would you mind sharing one of the queries that's slow so we can get a better idea of what's happening? Could you please also share your mapping for the index you're searching?
Thanks Jason! I believe the highlighter is the culprit here. Using highlighter with large fields will always incur performance considerations because it has to load the entire text, analyze it, and search it.
Set term_vector to with_position_offsets in the mapping. This also requires reindexing. It will also increase the size of the index significantly. See the above links for more info on this option.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.