FSCrawler - Elastic Mapping Changes

Hi,

Can anyone please guide me on the below.

Q1)

Is it possible for us to change the tag names in elastic search while ingesting documents using fscrawler.

Example: The content that a file has is getting ingested under "_source.content" tag in elastic. Can we change this "_source.content" to "_source.passage".

Q2)

I believe fscrawler is internally using Apache TIKA for extracting metadata and content from a file. Is there any possibility to split the content that we are saving in elastic search? Instead of saving under a single tag "_source.content", is there any existing way to split and save the document (page/paragraph wise) under same/different index.

Regards,
Sarath

You can define an ingest pipeline which uses a rename processor.
Then set this pipeline in FSCrawler.

No it's not possible today. There have been a similar ask here:

IIRC it could be done but for sure this is not going to be implemented anytime soon.
Best option for now would be to preprocess the PDF document, and generate one file per page before starting FSCrawler or calling its REST API.

Thank you David.

Also, when FScrawler crawls through all the documents for indexing them, will the content of the document saved on my Elastic cluster?? or is there any way it works internally?
Because even when I stop my crawler I will still be able to read the content while searching.

Everything is sent to elasticsearch and indexed and stored there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.