Does FSCrawler support chunking?

Hey all! Hope you are doing great. I've recently started working on a solution using Elasticsearch and we have the need to parse and upload different kinds of documents, such as emails, ppts, pdfs, etc.

The client requested to have the possibility of querying images contained in said documents so we decided to use FSCrawler thanks to its integrated OCR processing.

Here's my question, is there a way to chunk documents after performing the OCR and before uploading them to Elasticsearch. We would really like to have chunks indexed instead of the whole document so that we can perform a more fine grained search and vector embeddings (if it comes to that).

Is there a way to implement chunking or embeddings within FScrawler?

Thanks!

Hey @Santiago_Rubio ! I'm tagging @dadoonet , as he's the author/maintainer of FSCrawler, and may have some thoughts on this.

My first thought though is that if you're able to use the most recent version of Elasticsearch (8.15), using the semantic_text field type will do chunking for you. You can read about this field type here: Semantic text field type | Elasticsearch Guide [8.15] | Elastic.

If not, you may be looking at using Ingest Pipelines to do chunking. See this blog for some ideas: Chunking large documents via ingest pipelines and nested vectors — Search Labs.

1 Like

Hey Sean! Thanks for the really quick response. I'll take a look at the semantic text field, wasn't aware of that feature.

Regarding the ingest pipeline, could it be possible to use both FSCrawler and the ingest pipeline? AFAIK, FSCrawler uses its own pipeline and leaves all data already within an index.

Thanks!

Sure. Look at Elasticsearch settings — FSCrawler 2.10-SNAPSHOT documentation

1 Like

Hey David! Can't believe I'm talking to you hehe. I'll take a look and see what I can work out. If I have any more doubts I'll keep posting on this thread.

Thanks for the reply, Enjoy your weekend!

1 Like

Hey David! Maybe I should create a new topic for this, but, is there a way to improve FSCrawler's throughput? We are trying to index 2k documents more or less and found out that FSCrawler started pretty strong but started slowing down with time. These documents are nothing out of the ordinary: a lot of average pdfs, some txts, mails and Microsoft Office files.

In terms of memory, it was only using 60% of the available amount so we don't really know what's going on.

Would modifying bulk_size, flush_interval and byte_size affect its speed?

For contextual details, we have OCR enabled and are sending each document to a custom ingest pipeline that chunks the content in a for loop and then only keeps some of the fields that for FSCrawler's output.

Thanks in advance!

The problem with the current implementation is that it's monothreaded. I need to create a version 3 which is all asynchronous and able to scale using as many threads as possible. But nothing in the short term though.

If you have let say 5 dirs in your root directory, you could start 5 fscrawler instances instead.
That might help.

The contention point is sadly the ocr part IMO.

Perfect, thanks a ton David! Enjoy your weekend!

Just so I resolve this issue, we connected FSCrawler to a ingest pipeline with 2 types of pipelines. The first is a script pipeline that manually chunks text resulting from FSCrawler's OCR within a for loop and the second one is a remove pipeline that only keeps the fields that we were interested in.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.