FSCrawler large document and indexing based on content

Tom_Corbett · November 29, 2017, 2:22pm

Hi,

Complete novice here. First post, Hi everyone, go easy on me

I have setup FSCrawler today and it is awesome. Thanks for creating it for us.

I do have an issue with content becoming truncated from larger files. An example is an excel file that is 12mb. Not all of the content makes it into elasticsearch.
Where do I increase the file size? Is it on the fs config, elasticsearch config or both?

Also is it possible to only index certain content from a document into elasticsearch? Such as a zip code using a regex?

Thanks,

TC

dadoonet · November 29, 2017, 10:42pm

Not all of the content makes it into elasticsearch.

I think it's because of GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)

We by default only extract the first 100 000 characters.

Also is it possible to only index certain content from a document into elasticsearch? Such as a zip code using a regex?

Could you explain with a more detailed example? Like what you have as raw text in your document and what you want to have indexed?

Tom_Corbett · November 30, 2017, 8:09am

Hi @dadoonet

Thanks for taking the time to create FSCrawler and the time to reply.

I cant believe I missed that option having read the excellent documentation lots yesterday

For the other point. Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"

Thank you.

dadoonet · November 30, 2017, 10:06am

Thanks for taking the time to create FSCrawler and the time to reply.

Thanks a lot!

I cant believe I missed that option having read the excellent documentation lots yesterday

Yeah. I'm thinking of changing a bit the documentation and split into multiple pages and may be do something like Installation - Rally 2.10.0.dev0 documentation which I like a lot.

(I need time which is hard to find those days )

Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"

No. But may be this is something that could be implemented so could you open an issue in FSCrawler project, like "Do not index doc if extracted text matches a regex".

Thanks!

system · December 28, 2017, 10:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fscrawler index large file Elasticsearch	11	769	May 18, 2018
[ANNOUNCEMENT] - Elasticsearch File System Crawler 2.3 released Community Ecosystem	3	3549	June 23, 2020
Enhance performance when using FSCrawler and Elasticsearch together Elasticsearch	2	1442	January 6, 2019
Efficient Metadata Indexing for Large Filesystem in Elasticsearch Elasticsearch	2	139	April 23, 2024
FSCrawler - Best approach to load massive amount of document Elasticsearch	3	819	October 22, 2021

FSCrawler large document and indexing based on content

Related topics