How to index the PDF and image documents into elasticsearch. Would like to extract the entities to enable the search on keywords. Whether the workplace search provide this functionality? Whether Apache Tika has been used within elasticsearch or the NLP modules to accomplish this functionality.?
Primarily would like to index few thousands of PDF/Image documents from
I have seen both these options. FSCrawler looks to be the best option. It can feed to Workplace search as well, which provides us with nice UI for search along with facets.
If I want to use workplace search with the source as Onedrive or Sharepoint online, Whether the same functionality can be achieved?
If we want to use the ingest attachment plugin, how to feed the documents (PDF/IMAGE) in bulk?
Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.
Sure. Have checked those documents and not able to see the list of supported file formats from those data sources. So default PDF and image file formats are supported?
Can you please provide your inputs for the following query?
Also I am looking into the NLP ML models which are being used in elasticsearch can help to tag these documents with relevant tags. That way the search can be done with the exact value of the identified tags.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.