Hey all! Hope you are doing great. I've recently started working on a solution using Elasticsearch and we have the need to parse and upload different kinds of documents, such as emails, ppts, pdfs, etc.
The client requested to have the possibility of querying images contained in said documents so we decided to use FSCrawler thanks to its integrated OCR processing.
Here's my question, is there a way to chunk documents after performing the OCR and before uploading them to Elasticsearch. We would really like to have chunks indexed instead of the whole document so that we can perform a more fine grained search and vector embeddings (if it comes to that).
Is there a way to implement chunking or embeddings within FScrawler?
Hey @Santiago_Rubio ! I'm tagging @dadoonet , as he's the author/maintainer of FSCrawler, and may have some thoughts on this.
My first thought though is that if you're able to use the most recent version of Elasticsearch (8.15), using the semantic_text field type will do chunking for you. You can read about this field type here: Semantic text field type | Elasticsearch Guide [8.15] | Elastic.
Hey Sean! Thanks for the really quick response. I'll take a look at the semantic text field, wasn't aware of that feature.
Regarding the ingest pipeline, could it be possible to use both FSCrawler and the ingest pipeline? AFAIK, FSCrawler uses its own pipeline and leaves all data already within an index.
Hey David! Can't believe I'm talking to you hehe. I'll take a look and see what I can work out. If I have any more doubts I'll keep posting on this thread.
Hey David! Maybe I should create a new topic for this, but, is there a way to improve FSCrawler's throughput? We are trying to index 2k documents more or less and found out that FSCrawler started pretty strong but started slowing down with time. These documents are nothing out of the ordinary: a lot of average pdfs, some txts, mails and Microsoft Office files.
In terms of memory, it was only using 60% of the available amount so we don't really know what's going on.
Would modifying bulk_size, flush_interval and byte_size affect its speed?
For contextual details, we have OCR enabled and are sending each document to a custom ingest pipeline that chunks the content in a for loop and then only keeps some of the fields that for FSCrawler's output.
The problem with the current implementation is that it's monothreaded. I need to create a version 3 which is all asynchronous and able to scale using as many threads as possible. But nothing in the short term though.
If you have let say 5 dirs in your root directory, you could start 5 fscrawler instances instead.
That might help.
Just so I resolve this issue, we connected FSCrawler to a ingest pipeline with 2 types of pipelines. The first is a script pipeline that manually chunks text resulting from FSCrawler's OCR within a for loop and the second one is a remove pipeline that only keeps the fields that we were interested in.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.