FSCrawler large document and indexing based on content

Hi,

Complete novice here. First post, Hi everyone, go easy on me :slight_smile:

I have setup FSCrawler today and it is awesome. Thanks for creating it for us.

I do have an issue with content becoming truncated from larger files. An example is an excel file that is 12mb. Not all of the content makes it into elasticsearch.
Where do I increase the file size? Is it on the fs config, elasticsearch config or both?

Also is it possible to only index certain content from a document into elasticsearch? Such as a zip code using a regex?

Thanks,

TC

Not all of the content makes it into elasticsearch.

I think it's because of https://github.com/dadoonet/fscrawler#extracted-characters

We by default only extract the first 100 000 characters.

Also is it possible to only index certain content from a document into elasticsearch? Such as a zip code using a regex?

Could you explain with a more detailed example? Like what you have as raw text in your document and what you want to have indexed?

Hi @dadoonet

Thanks for taking the time to create FSCrawler and the time to reply.

I cant believe I missed that option having read the excellent documentation lots yesterday :blush:

For the other point. Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"

Thank you.

Thanks for taking the time to create FSCrawler and the time to reply.

Thanks a lot!

I cant believe I missed that option having read the excellent documentation lots yesterday

Yeah. I'm thinking of changing a bit the documentation and split into multiple pages and may be do something like https://esrally.readthedocs.io/en/latest/install.html which I like a lot.

(I need time which is hard to find those days :slight_smile: )

Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"

No. But may be this is something that could be implemented so could you open an issue in FSCrawler project, like "Do not index doc if extracted text matches a regex".

Thanks!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.