Is it possible for us to change the tag names in elastic search while ingesting documents using fscrawler.
Example: The content that a file has is getting ingested under "_source.content" tag in elastic. Can we change this "_source.content" to "_source.passage".
Q2)
I believe fscrawler is internally using Apache TIKA for extracting metadata and content from a file. Is there any possibility to split the content that we are saving in elastic search? Instead of saving under a single tag "_source.content", is there any existing way to split and save the document (page/paragraph wise) under same/different index.
No it's not possible today. There have been a similar ask here:
IIRC it could be done but for sure this is not going to be implemented anytime soon.
Best option for now would be to preprocess the PDF document, and generate one file per page before starting FSCrawler or calling its REST API.
Also, when FScrawler crawls through all the documents for indexing them, will the content of the document saved on my Elastic cluster?? or is there any way it works internally?
Because even when I stop my crawler I will still be able to read the content while searching.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.