I need to parse some PDF files to find the most relevant words and trying to find what PDF are talking about.
Those PDF are vendors catalog, I need for example to extract products to propose them in a web search engine (actually powered by Elasticsearch with entities parsed from database).
Some catalogs could also be EXCEL files and those files are not standardized.
I used Elasticsearch a long time ago and I am not a data scientist
My project uses Elasticsearch as standalone, with a PHP client :
Thank you I will check ingest attachment plugin documentation.
Any advice how my search into PDF could be more relevant ? I thought maybe I could count words to find out which words are the most relevant but I don't think it'will be the best solution.
Maybe there is a kind of IA or something ? Or should I parse my PDF first with an other application ?
What do you mean? Do you have an example for this question?
Ideally open a new discussion about it because this one is marked as solved and your original question is not directly related to the new question IMO.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.