Thanks!
I actually found both of these using google but my concerns are the following:
mapper-attachments is a plugin that basically converts the documents to text version and allows you to index them if i got it right but i don't see how it automates the process.
fscrawler now states it's standalone and not supported by Es so does the results integrate with ES later?
My question was regarding the big warning " Elasticsearch 2.0.0 doesn't support anymore rivers." since rivers were a way for ES to get data from external sources i was asking how FScrawler handles this in 2.0 since it's supported anymore, i mean are the results still compatible and acceptable by ES db?
I didn't notice you are the owner of FScrawler
I guess it should still workl, i will test it out and report back.
Right now i will test it on windows, later maybe i"l have a linux machine.
Is there any way to not include the \r\n whitespaces in the _source, in my docs and docx there are different parts separated by spaces, i would prefer the new lines to be in but not printed...
Is there a way to make it recycle memory or will java just keep eating all the memory until there is no more or the scan finishes?
I've tried hooking the results into Kibana and i didn't get any results in the discover tab, here is the template i used:
I've used test2 and test2* as the index name and select date modified (which exists in ES) as the field and still got nothing in the discovery, plus it only shows _source as the field it searches.
No. It's indexed as it is extracted by Tika. But TBH I did not understand what is the problem. May be illustrate with an example what you have now and what you would like to see?
May be some enhancements need to be done in fscrawler project. For sure I should support adding easily memory settings to the fscrawler job. For now, you have to hack the script or set $JAVA_OPTS.
I opened Add FS_JAVA_OPTS JVM option · Issue #134 · dadoonet/fscrawler · GitHub for this. Feel free to contribute!
I never tested it with Kibana for now. I'd advice that you first test with simple curl commands that everything has been indexed as expected. Is it the case?
So the result is header\nbodynfooter\n so when i view it i want to see it as the original (separated) and not just like one long string.
Ok nice, i will defenalty try to help
The thing is that at the end i need a dashboard and kibana is an easy choice here so i must have it working, i appreciate the concern and you are correct first try it simple but i also need it to work and if you haven't tested it yet i will gladly volunteer here
First of all it works fine with Kibana i just had the wrong time settings so FYI on that.
Secondly sure here is an example:
Defect subject: XXXX
Product: XXXX vX.X
Severity: XXX
Description:
First paragraph: A short explanation of the issue.
Technical Details:
Technical details about how the product was tested.
1. Example:
Figure
2. Example:
Figure
Recommended Remediation:
Recommendation
So I extracted your file with fscrawler and got: Defect subject: XXXX\nProduct: XXXX vX.X\nSeverity: XXX\nDescription\nFirst paragraph: A short explanation of the issue.\nTechnical Details\nTechnical details about how the product was tested.\nExample:\nFigure\n1. Example:\nFigure\n\nRecommended Remediation\n1. Recommendation\n1. Recommendation\n1. Recommendation\n\n
I was then able to search for figure for example without any issue.
The problem is with the format, it's not human readable so you can search for words and find them in the whole mess of a text as you showed but if you are only interested in reading a particular section it's hard to find quick where one beings and another ends...
It would be much simpler if the \n characters weren't represented as strings.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.