How Attachments or file storage and searching is handled in Elasticsearch

Hello Team,

Could you please help me understand How Attachments or files like (pdf/text files/doc) are stored in elasticsearch and how the content and inverted index managed for searching.
tried searching the documentation but could not find specific to this.

Thank you very much for your help and support.

Thank you,
Aditya

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.
The content is then extracted as text and available within the data field in this example. I recommend removing (and not storing) the BASE64 binary content as it can consume a lot of space.

Thank you very much Dadoonet for your quick reply,

yes , right.. only one query i was thinking about the UI ... how we can display the searched/match word or phrase on UI,
i will try to explain what, lets say i have a text file with elatsicseach as word in that file and i crawled the file and removed the binary field, and from KIbana i tried to search for elasticsearch then this document should come up since we don not have the contents how can we display that the word is found on this document.

Thank you for your help and support.

Thank You,
Aditya

You do have the content, the text content. You don't have the binary source, that's all.

Thank you Very much Dadoonet, I tried crawling some sample files and i am getting the conetents in attachment.content field
one query though,
based on the comments on the documentation ,

  1. to use dedicated ingest node as Extracting contents from binary data is could be resource intensive operation
  2. Overhead of converting file in base64 encoding

any thoughts on which is recommended to ingest files FSCrawler or the plugin.

Thank you very much for your help and support.

Thank you,
Aditya

Yes sending big json files (because of the binary content) over the network, storing it in memory, doing the text extraction... is memory consuming. I think that using dedicated ingest nodes is better for this use case as it minimizes the risk for the data nodes to run OOM.

FSCrawler does this extraction in a separate java process and only sends the extracted data over the network.

Imagine a PDF document of 100mb which contains only "Hello world" as a text and the rest is images. With the attachment plugin, you have to send the 100mb of data over the network. With FSCrawler only few kb will be sent.

Thank you very much dadoonet, it really helps.
:slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.