Hello, I am developing a backend and I have big pdf files(~20 MB), which I want to search words inside.
I could upload the texts inside them to Sql or somewhere and search inside them, but this would be slow and pdf's have lots of images inside which makes me think of using OCR (and Elasticsearch gives the ability for this)
But these are not log files or something, there are only words in it. I tried to upload it via Kibana, tutoarial -upload file page but it gives me error "File structure cannot be determined".
1 - Is it possible to search inside files with random texts in it?
2 - How to upload file and index if possible?
You need to extract the text out from the PDFs somehow and push the text to Elasticsearch to be able to search it. There are multiple ways to get the text out from PDFs. However, often getting just the text is not enough but you want to more with it. Structure it according to the content and understand the meaning of the text. For example extract dates and document classifications and out from the text you often want to extract meaningful content, such as companies and persons as metadata also. It all goes down to what use case you really are solving.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.