okay. If possible can you please tell the advantages of using ingest attachment processor over converting to text and post the json into elasticsearch?
Well. Ingest attachment is already a working product that you don't have to write and maintain.
On the opposite it can be limited to some formats and can consume node memory.
That's one of the reason I wrote FSCrawler project which is running outside elasticsearch, using Tika as well, but all Tika which means more file types are supported, including OCR as well.
But if you are happy with the extraction you did on your side, then just use it. That's perfectly fine IMO.
Can you give me an insight on how FSCrawler indexes the docs to elasticsearch. Page by page or in some other way. And how are indexes created.
Will be more helpful if you give any additional information of what all libraries did you use in FSCrawler(I saw tesseract and Tika).
I understand you have very less time, please respond accordingly. appreciate your contributions. Thank you so much
How should I apply analysers/tokenizers once I've indexed the files.
I guess it's too late. This must be done at index time.
So you define your mapping with whatever analyzer you want to use on the content field for example before indexing the first document.
The analyzer will be used at index time and search time.
In between, we have indexed pdf files which is not satisfying our requirement. We have to index pdf page by page. but tika or any other libraries is not supporting page by page extraction.
requirement: whenever user search something the relevant page/data(but not whole pdf) should be displayed
1. Is there any other way to index pdf files page by page.
2. Can we achieve this using ingest attachment processor
Please let me know, Any suggestions will be helpful.
If you get traditional xhtml from Tika, it shouldn't be too hard to scrape out the "<div class="page">...</div>" elements. @dadoonet is right, though, that per page extraction doesn't currently exist in "off-the-shelf" Tika.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.