Does FSCrawler support chunking?

Santiago_Rubio · August 30, 2024, 6:56pm

Hey all! Hope you are doing great. I've recently started working on a solution using Elasticsearch and we have the need to parse and upload different kinds of documents, such as emails, ppts, pdfs, etc.

The client requested to have the possibility of querying images contained in said documents so we decided to use FSCrawler thanks to its integrated OCR processing.

Here's my question, is there a way to chunk documents after performing the OCR and before uploading them to Elasticsearch. We would really like to have chunks indexed instead of the whole document so that we can perform a more fine grained search and vector embeddings (if it comes to that).

Is there a way to implement chunking or embeddings within FScrawler?

Thanks!

Sean_Story · August 30, 2024, 7:03pm

Hey @Santiago_Rubio ! I'm tagging @dadoonet , as he's the author/maintainer of FSCrawler, and may have some thoughts on this.

My first thought though is that if you're able to use the most recent version of Elasticsearch (8.15), using the semantic_text field type will do chunking for you. You can read about this field type here: Semantic text field type | Elasticsearch Guide [8.15] | Elastic.

If not, you may be looking at using Ingest Pipelines to do chunking. See this blog for some ideas: Chunking large documents via ingest pipelines and nested vectors — Search Labs.

Santiago_Rubio · August 30, 2024, 7:31pm

Hey Sean! Thanks for the really quick response. I'll take a look at the semantic text field, wasn't aware of that feature.

Regarding the ingest pipeline, could it be possible to use both FSCrawler and the ingest pipeline? AFAIK, FSCrawler uses its own pipeline and leaves all data already within an index.

Thanks!

dadoonet · August 30, 2024, 8:07pm

Sure. Look at Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

Santiago_Rubio · August 30, 2024, 8:59pm

Hey David! Can't believe I'm talking to you hehe. I'll take a look and see what I can work out. If I have any more doubts I'll keep posting on this thread.

Thanks for the reply, Enjoy your weekend!

Santiago_Rubio · September 5, 2024, 12:09pm

Hey David! Maybe I should create a new topic for this, but, is there a way to improve FSCrawler's throughput? We are trying to index 2k documents more or less and found out that FSCrawler started pretty strong but started slowing down with time. These documents are nothing out of the ordinary: a lot of average pdfs, some txts, mails and Microsoft Office files.

In terms of memory, it was only using 60% of the available amount so we don't really know what's going on.

Would modifying bulk_size, flush_interval and byte_size affect its speed?

For contextual details, we have OCR enabled and are sending each document to a custom ingest pipeline that chunks the content in a for loop and then only keeps some of the fields that for FSCrawler's output.

Thanks in advance!

dadoonet · September 5, 2024, 7:52pm

The problem with the current implementation is that it's monothreaded. I need to create a version 3 which is all asynchronous and able to scale using as many threads as possible. But nothing in the short term though.

If you have let say 5 dirs in your root directory, you could start 5 fscrawler instances instead.
That might help.

The contention point is sadly the ocr part IMO.

Santiago_Rubio · September 6, 2024, 5:10pm

Perfect, thanks a ton David! Enjoy your weekend!

Just so I resolve this issue, we connected FSCrawler to a ingest pipeline with 2 types of pipelines. The first is a script pipeline that manually chunks text resulting from FSCrawler's OCR within a for loop and the second one is a remove pipeline that only keeps the fields that we were interested in.

system · October 4, 2024, 5:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch fscrawler Elasticsearch	19	177	July 30, 2024
FSCrawler Question Elasticsearch	7	3085	March 17, 2017
Recommended workflow for indexing many binary docs Elasticsearch	4	766	July 6, 2021
Is there a way to feed base64 encoded string to the ingest_attachment plugin or fscrawler? Elasticsearch	11	1484	August 5, 2021
FScrawler: Production Support Available? Elasticsearch	6	446	May 21, 2020

Does FSCrawler support chunking?

Related topics