Is it possible to search in random files with random texts in it with Elastic Search?

Hello, I am developing a backend and I have big pdf files(~20 MB), which I want to search words inside.

I could upload the texts inside them to Sql or somewhere and search inside them, but this would be slow and pdf's have lots of images inside which makes me think of using OCR (and Elasticsearch gives the ability for this)

But these are not log files or something, there are only words in it. I tried to upload it via Kibana, tutoarial -upload file page but it gives me error "File structure cannot be determined".

1 - Is it possible to search inside files with random texts in it?
2 - How to upload file and index if possible?

You need to extract the text out from the PDFs somehow and push the text to Elasticsearch to be able to search it. There are multiple ways to get the text out from PDFs. However, often getting just the text is not enough but you want to more with it. Structure it according to the content and understand the meaning of the text. For example extract dates and document classifications and out from the text you often want to extract meaningful content, such as companies and persons as metadata also. It all goes down to what use case you really are solving.

You can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.15] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

Here is an example pdf I want to search in https://www.resmigazete.gov.tr/eskiler/2021/11/20211123.pdf .
I want to

  • Get if the word exists in the pdf
  • Get the words (for example 100 chars before and after the word) around the word we search for.

FSCrawler Will do that as it supports ocr.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.