Not able to search through attachment contents


(Suyog Kale) #1

I am new to ElasticSearch and evaluating PDF files indexing.

I have used NEST .net plugin, to index pdf files. I have used steps as described in one of stackoverflow post http://stackoverflow.com/questions/25917386/client-net-nest-with-attachment-highlight-feature

I am able index pdf contents with Convert.ToBase64String method. and document is getting indexed.

I am able to search plain from Title field but not able to search text contents from PDF file, it returns me zero hits.

Can someone please help on this.


(David Pilato) #3

Did you install mapper attachments plugin? What can you see in logs?


(Suyog Kale) #4

Yes I have installed mapper attachments plugin and restarted cluster, it is displayed in clusters plugin list.Please refer screenshot for same:

Below is screenshot for uploaded sample index:


(Suyog Kale) #5

Here is sample C# code I have used:


(David Pilato) #6

Anything in elasticsearch logs?


(Suyog Kale) #7

Where I can find log files? Is there any option in ES-Head?


(Suyog Kale) #8

I dont see any error in ES logs:


(David Pilato) #9

Can you show the mapping for your type and what a typical JSON document looks like?


(Suyog Kale) #11

Please refer below C# code for mapping and i am uploading pdf as file:


(David Pilato) #12

Can you please run the following queries on your cluster?

  • GET /data/doc/1
  • GET /data/doc/_mapping

(Suyog Kale) #13

Yes, Here are results:


(David Pilato) #14

As you can see, your mapping is incorrect.
There is no attachment type in it.

So the file content is not analyzed with the mapper attachments plugin.

Remove your index, create it again, PUT the mapping, check that it has been applied, then index your docs.


(Suyog Kale) #15

how should I define field type as an attachment? any reference link for c#?


(David Pilato) #16

You can read the doc: https://github.com/elastic/elasticsearch-mapper-attachments#using-mapper-attachments

I don't know about C# so I can't tell how to translate that in that language. Might not be hard though.


(Mark Walkom) #17

@Suyog_Kale FYI in all of your pictures we can see your Found cluster ID, which means someone can potentially get access to your data.

I'd strongly suggest that you remove/edit the pictures.


(Suyog Kale) #18

Thank you David,

Now I am able to configure mapping and able to index pdf contents.

Now problem is when I execute search it returns records but not able to highlight actual file contents, it displays file binary data:

Any suggestion?


(Suyog Kale) #19

What I also observed is that even there is no match in contents it returns all records in search result:


(David Pilato) #20

Head plugin is buggy. Use POST instead of GET


(Aj) #21

Hi, I have the same issue that I cannot search from the attached document using NEST client
My mapping is

 {
 "mydocs": {
  "mappings": {
     "indexdocument": {
        "properties": {
           "docLocation": {
              "type": "string",
              "index": "not_analyzed",
              "store": true
           },
           "documentType": {
              "type": "string",
              "store": true
           },
           "file": {
              "type": "attachment",
              "fields": {
                 "content": {
                    "type": "string",
                    "analyzer": "full"
                 },
                 "author": {
                    "type": "string"
                 },
                 "title": {
                    "type": "string",
                    "term_vector": "with_positions_offsets",
                    "analyzer": "full"
                 },
                 "name": {
                    "type": "string"
                 },
                 "date": {
                    "type": "date",
                    "format": "strict_date_optional_time||epoch_millis"
                 },
                 "keywords": {
                    "type": "string"
                 },
                 "content_type": {
                    "type": "string"
                 },
                 "content_length": {
                    "type": "integer"
                 },
                 "language": {
                    "type": "string"
                 }
              }
           },
           "filePermissionInfo": {
              "properties": {
                 "accessControlType": {
                    "type": "string",
                    "store": true
                 },
                 "accountValue": {
                    "type": "string",
                    "store": true
                 },
                 "fileSystemRights": {
                    "type": "string",
                    "store": true
                 },
                 "isInherited": {
                    "type": "string",
                    "store": true
                 }
              }
           },
           "id": {
              "type": "double",
              "store": true
           },
           "lastModifiedDate": {
              "type": "date",
              "store": true,
              "format": "strict_date_optional_time||epoch_millis"
           },
           "otherDetails": {
              "type": "string"
           },
           "title": {
              "type": "string",
              "store": true,
              "term_vector": "with_positions_offsets"
           }
        }
     }
  }
 }
}

My Post query is working fine

POST /mydocs/_search
{
"query" : {
    "bool" : {
        "must" : [
           
            { "match" : { "filePermissionInfo.accountValue" : "S-1-5-18"}} ,
           { "match":{"otherDetails":"xyz"}},
            { "match":{"file.content":"abc"}}              
           
        ]
    }
}
}

But when I convert it to C#, Its not working. If I remove the File.Content field from the match query , it returns resultset. So I think the problem is with the attachment field. It is base64 encoded

var queryResult = client.Search<IndexDocument>(s => s
                            .Index("mydocs")
                            .Query(q => q
                            .Bool(b => b
                            .Must(m =>
                                 m.Match(mt1 => mt1.Field(f1 => f1.DocumentType).Query(queryTerm)) &&
                                 m.Match(mt2 => mt2.Field(f2 => f2.FilePermissionInfo.First().AccountValue).Query(accountName)) &&
                                 m.Match(mt3 => mt3.Field(f3 => f3.OtherDetails).Query(other))
                             ))) );

Can you please help?


(Aj) #22

@dadoonet Can you please look into my issue?