How to ingest/push multiple BIG attachments into indexed document's array attribute in ElasticSearch?


(Rafal Lyczkowski) #1

Hey,

using: ES 5.1

I'm trying to figure out the solution for indexing multiple attachments for single document into Elasticsearch Index.

I have some limitations around my service (using AWS) that limits my HTTP request up to 100MB per single POST.

Basically I have profiles in ES index, and for each profile I want to store multiple searchable attachments, let's say up to 50 x 10MB pdfs
That requirement limits my approach because I just cannot send to ES of total 500 MB of data.

One of the approach was to make some kind of partial updates, but still how to make 'partial-Update' by pushing NEW attachment to the existing attachments' array?
Maybe some flatten attachments index and reference to my main index to find profiles out?

I have to also support highliting in result, so the best approach for me is to have mapping like this:

{
  "directory.index.v7": {
    "mappings": {
      "profile.event": {
        "properties": {
          "attachments": {
            "properties": {
              "attachment": {
                "properties": {
                  "content": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "content_length": {
                    "type": "long"
                  },
                  "content_type": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "date": {
                    "type": "date"
                  },
                  "language": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              },
              "data": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "filename": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "email": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }
}

But as I meanioned before:

  • I cannot ingest all attachment at once.
  • I don know how and if is possible to make ATTACHMENT PUSH to attachments attribute without including older docs (to not reach a limit for POST)

Please advise!


(David Pilato) #2

Instead of indexing one array of attachments, why not indexing individual attachments one by one?


(Rafal Lyczkowski) #3

This is not a collection of attachments though, but of "profiles-having-attachments" I want to find PROFILES by searching through attachments array.


(David Pilato) #4

If for each attachment, you store the profile information, then you can may be retrieve that information...

Otherwise, I'd recommend doing the extraction on your side before sending the data to elasticsearch. That way json documents will not be huge BASE64 content but just extracted text.

Similar to what FSCrawler is doing. BTW you can use it and its REST layer to simulate an upload and get back the extracted text... Might help.


(Rafal Lyczkowski) #5

Currently I'm trying to PARSE docs on my end, to get text only (40-100MB) docs are producing actually up to 500kB of text. But still having array forces me to try parent-child approach instead of nested field


(David Pilato) #6

You can change the elasticsearch settings may be to allow more than 100mb per request over the wire?

I have some limitations around my service (using AWS) that limits my HTTP request up to 100MB per single POST.

Are you using elasticsearch as a service by AWS?
Or just EC2 instances where you deployed yourself elasticsearch?


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.