How to ingest/push multiple BIG attachments into indexed document's array attribute in ElasticSearch?

mbm-rafal · July 19, 2018, 5:30pm

Hey,

using: ES 5.1

I'm trying to figure out the solution for indexing multiple attachments for single document into Elasticsearch Index.

I have some limitations around my service (using AWS) that limits my HTTP request up to 100MB per single POST.

Basically I have profiles in ES index, and for each profile I want to store multiple searchable attachments, let's say up to 50 x 10MB pdfs
That requirement limits my approach because I just cannot send to ES of total 500 MB of data.

One of the approach was to make some kind of partial updates, but still how to make 'partial-Update' by pushing NEW attachment to the existing attachments' array?
Maybe some flatten attachments index and reference to my main index to find profiles out?

I have to also support highliting in result, so the best approach for me is to have mapping like this:

{
  "directory.index.v7": {
    "mappings": {
      "profile.event": {
        "properties": {
          "attachments": {
            "properties": {
              "attachment": {
                "properties": {
                  "content": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "content_length": {
                    "type": "long"
                  },
                  "content_type": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  },
                  "date": {
                    "type": "date"
                  },
                  "language": {
                    "type": "text",
                    "fields": {
                      "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                      }
                    }
                  }
                }
              },
              "data": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "filename": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "email": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }
}

But as I meanioned before:

I cannot ingest all attachment at once.
I don know how and if is possible to make ATTACHMENT PUSH to attachments attribute without including older docs (to not reach a limit for POST)

Please advise!

dadoonet · July 19, 2018, 8:28pm

Instead of indexing one array of attachments, why not indexing individual attachments one by one?

mbm-rafal · July 20, 2018, 9:20am

This is not a collection of attachments though, but of "profiles-having-attachments" I want to find PROFILES by searching through attachments array.

dadoonet · July 20, 2018, 5:28pm

If for each attachment, you store the profile information, then you can may be retrieve that information...

Otherwise, I'd recommend doing the extraction on your side before sending the data to elasticsearch. That way json documents will not be huge BASE64 content but just extracted text.

Similar to what FSCrawler is doing. BTW you can use it and its REST layer to simulate an upload and get back the extracted text... Might help.

mbm-rafal · July 20, 2018, 5:55pm

Currently I'm trying to PARSE docs on my end, to get text only (40-100MB) docs are producing actually up to 500kB of text. But still having array forces me to try parent-child approach instead of nested field

dadoonet · July 21, 2018, 5:43am

You can change the elasticsearch settings may be to allow more than 100mb per request over the wire?

I have some limitations around my service (using AWS) that limits my HTTP request up to 100MB per single POST.

Are you using elasticsearch as a service by AWS?
Or just EC2 instances where you deployed yourself elasticsearch?

system · August 18, 2018, 5:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How can we index array of attachments Elasticsearch	7	1496	July 6, 2017
Indexing articles with attachments Elasticsearch	5	582	December 11, 2019
Multiple mappings with attachment Elasticsearch	2	325	July 6, 2017
How do I map an array of attachments Elasticsearch	8	758	July 6, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	605	July 6, 2017

How to ingest/push multiple BIG attachments into indexed document's array attribute in ElasticSearch?

Related topics