@dadoonet Thank you for your reply. There is an xml available which contains the text. But inorder to get the text need to apply transforms on that XML. Which is doable but just an extra step in the process. And I also have the same text available in PDF format.
So which would be more efficient? Is the pain of extracting the text from XML worth over indexing the already available PDF using the plugin?
@dadoonet Sure then we will consider text. But just for deeper understanding, what do you mean by less memory consumed. I mean while creating the mapping, we can say to exclude to store the pdf content field in _source right.(as mentioned here https://qbox.io/blog/index-attachments-files-elasticsearch-mapper). Would it be still less memory consumed?
@dadoonet Sure. Whether there will be any significant performance impact while search or querying back the indexed pdf document? And what if the size of PDF documents is as big as 300 or 500 MB. Will it still work?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.