Identifying Attachment that Matches Query

I am searching over emails that each have attachments that I've ingested using the ingest attachment pipeline. Many emails have multiple attachments, and sometimes it's the content of the attachments themselves- the extracted text that satisfies the query.

My question is: how do I get the filename of the query that matches? I understand how to get the highlighted content of the matched field, but this is different than the filename.

The mapping for attachments i'm using are just the fields that the ingest-attachment plugin provides, using a foreach pipeline:

"attachments" : {
            "properties" : {
              "attachment" : {
                "properties" : {
                  "author" : {
                    "type" : "text",
                    "fields" : {
                      "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                      }
                    }
                  },
                  "content" : {
                    "type" : "text"
                  },
                  "content_length" : {
                    "type" : "long"
                  },
                  "content_type" : {
                    "type" : "keyword"
                  },
                  "date" : {
                    "type" : "date"
                  },
                  "language" : {
                    "type" : "keyword"
                  }
                }
              },
              "data" : {
                "type" : "object",
                "enabled" : false
              },
              "filename" : {
                "type" : "keyword"
              }
            }
          }

currently i can get the highlighted content of the attachments like:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_field": "_all",
            "query": "\"linkedin\"",
            "_name": "all fields"
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 1,
  "highlight": {
    "fields": {
      "attachments.filename": {},
      "attachments.attachment.content": {}
    },
    "require_field_match": false
  },
  "_source": {
    "excludes": [
      "attachments.attachment.content",
      "attachments.data"
    ]
  }
}

but as stated above, this isnt what I need.

Thanks!

Just to make sure I understand. You index multiple attachments within an array like:

{
  "attachments": [ {
      "attachment": {
        "content": "BASE64"
      },
      "filename": "file1.txt"
    },{
      "attachment": {
        "content": "BASE64"
      },
      "filename": "file2.txt"
    }
  ]
}

Am I correct? What is your mapping then?

Not sure I completely understand the question, I just copied and pasted the attachment part of my mapping above.

Are you using nested for attachments field. I believe you don't but prefer asking.

If you want to have a relationship between the file content and the file name you need to index separated documents in Lucene.

Which means either:

  • with separated documents in Elasticsearch (one per attachment, then copy all the email details in this attachment document)
  • using nested documents: each attachment will be indexed internally as a separated document in Lucene alongside another document which will contain all the other fields.

It seems the most straightfoward thing would be to use an explicitly nested type instead of just an array of attachments. I'll need to figure how to change my pipeline to handle this.

Given that attachments are nested, how would I change my above query to get the matched filenames?

I believe that if you have nested fields, you have to use a nested query and nested inner hits which will tell you which inner nested object matches so you can easily have access to its filename field.

hmm this may not be sufficient, I was hoping that if I search _all, and the result happens to be in an attachment, I could know which one.

No. That's not possible. _all is basically a field where every single data has been put in.

You can only do that if you create one single elasticsearch document per attachment (which I'd recommend).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.