Storing binary files in Elastic


(Steve Wall) #1

For my project, I have two requirements regarding documents (i.e. the actual binaries):

  1. be able to search the text within the document binary; and
  2. allow the user to download the document after I display the search results.

For #1, I was going to use the mapper-attachments plugin. I am taking the actual file and base64 encoding it to get a string which I put into a field in Elastic with type "attachment". This seems to be working.

For #2, I could take that encoded string from Elastic and decode it to a temporary file and have the user download that file. Is this a bad approach?

My concern is that I have read a couple different places that it is not recommended to use Elastic as document store or that Elastic can not store the binaries. Or is this a "terminology" issue... if I base64 encode the document, does that mean I am not using Elastic as a document store???

Thanks!


(David Pilato) #2

Note that mapper attachments plugin is deprecated and is replaced by ingest-attachment.
This later plugin "only" extracts text from your binary doc without storing it in elasticsearch.

Storing binary documents is not ideal. Imagine that you store a MP4 movie in a Lucene segment (well 4gb-10gb), it does not really make sense. Elasticsearch has not been designed for that purpose.
I like in such a case using another BLOB storage:

  • HDFS
  • CouchDB
  • S3
    ...

And just index the content in elasticsearch with a URL to the source blob.

I hope it makes sense.


(Steve Wall) #3

When I look at this:
https://www.elastic.co/guide/en/elasticsearch/plugins/master/using-ingest-attachment.html

It seems like Elastic is has the full base64 encoded version of the document as it has the mime type and other metadata of the document.

So I'm still confused as to why we would want to store the binary document twice (once in Elastic and once in a document store)? It seems like we could still get the content and serve up the raw file back to the user with only using Elastic. Can you elaborate further why this is a bad idea?


(David Pilato) #4

Because it is stored in Lucene and that Lucene is not really designed to be a datastore for big blobs documents.
Think about segment merging. You will have a massive number of IOs anytime the segments will be merged.

I'd love to have a better answer and being able to say "yeah: put whatever blob in elasticsearch" but that's unfortunately not the case.

Yes. By default it is still there. I would recommend removing the BASE64 source field from the doc in the ingest processors.


(Steve Wall) #5

Okay, so I'm coming around on how this might be a bad idea. :grinning:

But your last statement raised some more questions:

  1. if it's such a bad idea, why is it still there with the new Ingest processor in ES5?
  2. How do we get rid of that source field in the mapping?
  3. Finally, we still need to give ES the base64 encoded string so of the content so it can be indexed to be made searchable, correct? You are only suggesting that we do not then also keep that base64 encoded string stored within ES. Am I understanding you correctly?

Thanks for your time!


(David Pilato) #6
  1. Because you still need to extract text from binary documents. Having this text indexed is a good idea. Storing its source document is not.
  2. Use the remove processor in your pipeline.
  3. Yes correct. That's what I'm suggesting here.

Adding that if you store your document somewhere like on an HTTP server, then you should probably add in your document something like:

{
  "url": "http://path.to/my-doc.pdf"
}

Sure. Happy to help!


(Steve Wall) #7

David,

Unfortunately I am on ES 2.1.1 and have to use the mapper-attachments plugin. If I have a mapping like this:

PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment"
      }
    }
  }
}

Add a single document:
(Plain Text file - Content is: "God Save the Queen" (alternatively "God Save the King")

PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}

And then run this query:

GET /test/person/_search
{
  "query": {
    "match": {
      "_all": "save"
    }
  },
  "fields": [ "_source","file.content_type", "file.content" ]
}

This is my result:

  "hits": {
    "total": 1,
    "max_score": 0.10848885,
    "hits": [
      {
        "_index": "test",
        "_type": "person",
        "_id": "1",
        "_score": 0.10848885,
        "_source": {
          "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
        },
        "fields": {
          "file.content_type": [
            "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
          ],
          "file.content": [
            "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
          ]
        }
      }
    ]
  }

Three questions/comments:

  1. I see the base64 encoded value is stored in both the _source and the content fields. Is there a way in 2.1.1 to prevent it from being stored twice? Above you linked to a pipeline remove processor, but that is only ES5.
  2. How come content_type returns the base 64 encoded value as well? I noticed if I do add content_type to my mapping and say "store: yes", then the correct content type is auto discovered. Same applies for other fields such as content_length.
  3. In my mapping for the content field, if I say "store: yes", then that field only has the content of the document (in English). Is this likely what I want to do? Doing so, at least makes the content field much shorter (especially for larger, more complex Word docs).

Thanks again!
Steve


(David Pilato) #8

Your mapping does not ask to store file.content. So it won't work.

Look here: https://github.com/elastic/elasticsearch-mapper-attachments/tree/v3.1.2/#version-312-for-elasticsearch-21

Do something like:

DELETE /test
PUT /test
PUT /test/person/_mapping
{
  "person": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "type": "string",
            "store": true
          }
        }
      }
    }
  }
}
PUT /test/person/1?refresh=true
{
  "file": "IkdvZCBTYXZlIHRoZSBRdWVlbiIgKGFsdGVybmF0aXZlbHkgIkdvZCBTYXZlIHRoZSBLaW5nIg=="
}
GET /test/person/_search
{
  "fields": [ "file.content" ]
}

This should work. (Not tested as made from the top of my head).

  1. Use source exclude feature in the mapping if you wish. But here I'm unsure it's stored multiple times. That may be just a side effect that you did not store explicitly the file.content field.
  2. Same goes for the other meta data fields
  3. Yes. Store the fields you want to retrieve later. If you just want to search on those fields, then keep the default values.

(system) #9