Complete novice here. First post, Hi everyone, go easy on me
I have setup FSCrawler today and it is awesome. Thanks for creating it for us.
I do have an issue with content becoming truncated from larger files. An example is an excel file that is 12mb. Not all of the content makes it into elasticsearch.
Where do I increase the file size? Is it on the fs config, elasticsearch config or both?
Also is it possible to only index certain content from a document into elasticsearch? Such as a zip code using a regex?
Thanks for taking the time to create FSCrawler and the time to reply.
I cant believe I missed that option having read the excellent documentation lots yesterday
For the other point. Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"
Can I look for certain keywords in the content in an excel spreadsheets and index only those? Such as, if the world "Elephant" is found push the document to elasticsearch but ignore files not containing the content "Elephant?"
No. But may be this is something that could be implemented so could you open an issue in FSCrawler project, like "Do not index doc if extracted text matches a regex".
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.