Identifying Attachment that Matches Query

jattenberg · March 16, 2017, 3:05pm

I am searching over emails that each have attachments that I've ingested using the ingest attachment pipeline. Many emails have multiple attachments, and sometimes it's the content of the attachments themselves- the extracted text that satisfies the query.

My question is: how do I get the filename of the query that matches? I understand how to get the highlighted content of the matched field, but this is different than the filename.

The mapping for attachments i'm using are just the fields that the ingest-attachment plugin provides, using a foreach pipeline:

"attachments" : {
            "properties" : {
              "attachment" : {
                "properties" : {
                  "author" : {
                    "type" : "text",
                    "fields" : {
                      "keyword" : {
                        "type" : "keyword",
                        "ignore_above" : 256
                      }
                    }
                  },
                  "content" : {
                    "type" : "text"
                  },
                  "content_length" : {
                    "type" : "long"
                  },
                  "content_type" : {
                    "type" : "keyword"
                  },
                  "date" : {
                    "type" : "date"
                  },
                  "language" : {
                    "type" : "keyword"
                  }
                }
              },
              "data" : {
                "type" : "object",
                "enabled" : false
              },
              "filename" : {
                "type" : "keyword"
              }
            }
          }

currently i can get the highlighted content of the attachments like:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_field": "_all",
            "query": "\"linkedin\"",
            "_name": "all fields"
          }
        }
      ]
    }
  },
  "from": 0,
  "size": 1,
  "highlight": {
    "fields": {
      "attachments.filename": {},
      "attachments.attachment.content": {}
    },
    "require_field_match": false
  },
  "_source": {
    "excludes": [
      "attachments.attachment.content",
      "attachments.data"
    ]
  }
}

but as stated above, this isnt what I need.

Thanks!

dadoonet · March 16, 2017, 3:20pm

Just to make sure I understand. You index multiple attachments within an array like:

{
  "attachments": [ {
      "attachment": {
        "content": "BASE64"
      },
      "filename": "file1.txt"
    },{
      "attachment": {
        "content": "BASE64"
      },
      "filename": "file2.txt"
    }
  ]
}

Am I correct? What is your mapping then?

jattenberg · March 16, 2017, 3:33pm

Not sure I completely understand the question, I just copied and pasted the attachment part of my mapping above.

dadoonet · March 16, 2017, 3:42pm

Are you using nested for attachments field. I believe you don't but prefer asking.

If you want to have a relationship between the file content and the file name you need to index separated documents in Lucene.

Which means either:

with separated documents in Elasticsearch (one per attachment, then copy all the email details in this attachment document)
using nested documents: each attachment will be indexed internally as a separated document in Lucene alongside another document which will contain all the other fields.

jattenberg · March 16, 2017, 4:10pm

It seems the most straightfoward thing would be to use an explicitly nested type instead of just an array of attachments. I'll need to figure how to change my pipeline to handle this.

Given that attachments are nested, how would I change my above query to get the matched filenames?

dadoonet · March 16, 2017, 4:24pm

I believe that if you have nested fields, you have to use a nested query and nested inner hits which will tell you which inner nested object matches so you can easily have access to its filename field.

jattenberg · March 16, 2017, 5:05pm

hmm this may not be sufficient, I was hoping that if I search _all, and the result happens to be in an attachment, I could know which one.

dadoonet · March 16, 2017, 5:25pm

No. That's not possible. _all is basically a field where every single data has been put in.

You can only do that if you create one single elasticsearch document per attachment (which I'd recommend).

system · April 13, 2017, 5:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to search through ingest attachments Elasticsearch	2	386	July 11, 2019
Query attachment arrays [ingest mapper plugin] Elasticsearch	4	602	June 4, 2017
Searching attachment content with ingest attachment plugin Elasticsearch	3	3300	January 11, 2018
Querying attachment arrays [ingest mapper plugin] Elasticsearch	8	1102	March 9, 2017
Searching content doesn't show exact output Elasticsearch	8	1774	March 28, 2018

Identifying Attachment that Matches Query

Related topics