FSCrawler - Elastic Mapping Changes

Sarath_Pullabhotla · April 7, 2020, 10:01am

Hi,

Can anyone please guide me on the below.

Q1)

Is it possible for us to change the tag names in elastic search while ingesting documents using fscrawler.

Example: The content that a file has is getting ingested under "_source.content" tag in elastic. Can we change this "_source.content" to "_source.passage".

Q2)

I believe fscrawler is internally using Apache TIKA for extracting metadata and content from a file. Is there any possibility to split the content that we are saving in elastic search? Instead of saving under a single tag "_source.content", is there any existing way to split and save the document (page/paragraph wise) under same/different index.

Regards,
Sarath

dadoonet · April 7, 2020, 11:18am

You can define an ingest pipeline which uses a rename processor.
Then set this pipeline in FSCrawler.

No it's not possible today. There have been a similar ask here:

github.com/dadoonet/fscrawler

is it possible to create an elasticsearch doc per paragraph (words for example) ?

opened 02:50PM - 22 Aug 19 UTC

OlivierTLS

Hello, After some tests with words documents, i see that all the content are …saved in _source": { "content": field under elasticSearch. Is there a way to create an elasticSearch doc by Words Paragraph/Title. For example : **1 - titre 1** content 1 **2 - titre 2** content 2 ... Will generate 2 elastics doc : Doc 1 "titre" : "titre 1" "content" : "content 1" and some global metadata such as docpath, creationdate ... Doc 2 "titre" : "titre 2" "content" : "content 2" and some global metadata such as docpath, creationdate ... I know that i can do it by parsing the XML file of docs words and genearte a bulk with that data. But is it possible to do that with some fscrawler configuration ? Thanks, Olivier

IIRC it could be done but for sure this is not going to be implemented anytime soon.
Best option for now would be to preprocess the PDF document, and generate one file per page before starting FSCrawler or calling its REST API.

Sarath_Pullabhotla · April 7, 2020, 12:17pm

Thank you David.

Also, when FScrawler crawls through all the documents for indexing them, will the content of the document saved on my Elastic cluster?? or is there any way it works internally?
Because even when I stop my crawler I will still be able to read the content while searching.

dadoonet · April 7, 2020, 2:19pm

Everything is sent to elasticsearch and indexed and stored there.

system · May 5, 2020, 2:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler - change the index mapping，reduce redundant field or object Elasticsearch	5	224	April 20, 2023
Is there a way to feed base64 encoded string to the ingest_attachment plugin or fscrawler? Elasticsearch	11	1433	August 5, 2021
How can I ingest PDF and words files and extract keywords of these documents? Elasticsearch	8	3853	June 26, 2018
Fscrawler injest node pipeline Elasticsearch	2	528	November 13, 2017
FSCrawler large document and indexing based on content Elasticsearch	4	2353	December 28, 2017

FSCrawler - Elastic Mapping Changes

Related topics