Is it possible to search in random files with random texts in it with Elastic Search?

Sahin_Kasap · November 22, 2021, 6:41pm

Hello, I am developing a backend and I have big pdf files(~20 MB), which I want to search words inside.

I could upload the texts inside them to Sql or somewhere and search inside them, but this would be slow and pdf's have lots of images inside which makes me think of using OCR (and Elasticsearch gives the ability for this)

But these are not log files or something, there are only words in it. I tried to upload it via Kibana, tutoarial -upload file page but it gives me error "File structure cannot be determined".

1 - Is it possible to search inside files with random texts in it?
2 - How to upload file and index if possible?

ivar.ekman · November 22, 2021, 6:48pm

You need to extract the text out from the PDFs somehow and push the text to Elasticsearch to be able to search it. There are multiple ways to get the text out from PDFs. However, often getting just the text is not enough but you want to more with it. Structure it according to the content and understand the meaning of the text. For example extract dates and document classifications and out from the text you often want to extract meaningful content, such as companies and persons as metadata also. It all goes down to what use case you really are solving.

dadoonet · November 22, 2021, 9:34pm

You can use the ingest attachment plugin.

There an example here: Using the Attachment Processor in a Pipeline | Elasticsearch Plugins and Integrations [7.15] | Elastic

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

Sahin_Kasap · November 23, 2021, 6:57am

Here is an example pdf I want to search in https://www.resmigazete.gov.tr/eskiler/2021/11/20211123.pdf .
I want to

Get if the word exists in the pdf
Get the words (for example 100 chars before and after the word) around the word we search for.

dadoonet · November 23, 2021, 8:25am

FSCrawler Will do that as it supports ocr.

system · December 21, 2021, 8:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Please let me know if i can use elasticsearch for text search in pdf and word documents Elasticsearch	6	480	July 5, 2017
Searching for content in pdf and word documents Kibana	7	1898	August 30, 2020
Searching through PDF attachments and other documents in ElasticSearch with one query Elasticsearch	6	1704	October 29, 2020
How to use OCR in Elasticsearch ingest attachment plugin? Elasticsearch ingest-pipeline	12	6008	March 4, 2021
Search a PDF file using its content Elasticsearch	9	15790	February 11, 2019

Is it possible to search in random files with random texts in it with Elastic Search?

Related topics