Search - Attachment - Content


(Paco B) #1

Hello,

I want to search into attachment value but I don't get any outcome document (I tried to use other fields - no attachment field - and works properly). So, if I have the following document in Elastic Search:

GET /example/fs/LONG_XML_7

{
   "_index": "example",
   "_type": "fs",
   "_id": "LONG_XML_7",
   "_version": 1,
   "found": true,
   "_source": {
      "content": "PD94bWwgdmVyc2lvbj0iMS4wIj8+PGNhdGFsb2c+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPlBhY288L2F1dGhvcj48dGl0bGU+U3ByaW5nLUJvb3Q8L3RpdGxlPjwvYm9vaz48Ym9vayBpZD0iYmsxMDEiPjxhdXRob3I+QXV0aG9yIFRlc3Q8L2F1dGhvcj48dGl0bGU+VGl0bGUgVGVzdDwvdGl0bGU+PC9ib29rPjxib29rIGlkPSJiazEwMSI+PGF1dGhvcj5QYWNvPC9hdXRob3I+PHRpdGxlPkVsYXN0aWMgU2VhcmNoPC90aXRsZT48L2Jvb2s+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPk5vYm9keTwvYXV0aG9yPjx0aXRsZT5Ob3RoaW5nPC90aXRsZT48L2Jvb2s+",
      "upload_date": "2016-01-16T22:56:37.419000",
      "md5": "7e8c66a0a97e1fafaef1fc6ce36f4f28"
   }
}

And I try to do a search on the content field (attachment) like this:

GET /example/fs/_search
{
  "query": {
    "match": {
      "content": "Nothing"
    }
  }
}

I don't get any outcome document :frowning:

My mapping is the following:

{
   "example": {
      "mappings": {
         "fs": {
            "properties": {
               "chunkSize": {
                  "type": "integer",
                  "store": true
               },
               "content": {
                  "type": "attachment",
                  "path": "full",
                  "fields": {
                     "content": {
                        "type": "string"
                     },
                     "author": {
                        "type": "string"
                     },
                     "title": {
                        "type": "string"
                     },
                     "name": {
                        "type": "string"
                     },
                     "date": {
                        "type": "date",
                        "format": "dateOptionalTime"
                     },
                     "keywords": {
                        "type": "string"
                     },
                     "content_type": {
                        "type": "string"
                     },
                     "content_length": {
                        "type": "integer"
                     },
                     "language": {
                        "type": "string"
                     }
                  }
               },
               "data": {
                  "type": "attachment",
                  "path": "full",
                  "fields": {
                     "data": {
                        "type": "string"
                     },
                     "author": {
                        "type": "string"
                     },
                     "title": {
                        "type": "string"
                     },
                     "name": {
                        "type": "string"
                     },
                     "date": {
                        "type": "date",
                        "format": "dateOptionalTime"
                     },
                     "keywords": {
                        "type": "string"
                     },
                     "content_type": {
                        "type": "string"
                     },
                     "content_length": {
                        "type": "integer"
                     },
                     "language": {
                        "type": "string"
                     }
                  }
               },
               "files_id": {
                  "type": "string",
                  "store": true
               },
               "length": {
                  "type": "integer",
                  "store": true
               },
               "md5": {
                  "type": "string",
                  "store": true
               },
               "n": {
                  "type": "integer",
                  "store": true
               },
               "uploadDate": {
                  "type": "date",
                  "store": true,
                  "format": "dateOptionalTime"
               },
               "upload_date": {
                  "type": "date",
                  "format": "dateOptionalTime"
               }
            }
         }
         }
      }
   }
}

What is it wrong?

Many thanks!

Regards,
Paco.


Recommendation about storing too large XML as attachment to be searched
(Mark Walkom) #2

Looks like your content field is just binary, ie it doesn't contain the actual text in a readable manner.


(Paco B) #3

Hello Warkolm!

I don't think so. If you try any Web site which translates my encoded field from Base64 to String, you can see that it is not binary and it is a String.

My base64: PD94bWwgdmVyc2lvbj0iMS4wIj8+PGNhdGFsb2c+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPlBhY288L2F1dGhvcj48dGl0bGU+U3ByaW5nLUJvb3Q8L3RpdGxlPjwvYm9vaz48Ym9vayBpZD0iYmsxMDEiPjxhdXRob3I+QXV0aG9yIFRlc3Q8L2F1dGhvcj48dGl0bGU+VGl0bGUgVGVzdDwvdGl0bGU+PC9ib29rPjxib29rIGlkPSJiazEwMSI+PGF1dGhvcj5QYWNvPC9hdXRob3I+PHRpdGxlPkVsYXN0aWMgU2VhcmNoPC90aXRsZT48L2Jvb2s+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPk5vYm9keTwvYXV0aG9yPjx0aXRsZT5Ob3RoaW5nPC90aXRsZT48L2Jvb2s+

My outcome string:
<?xml version="1.0"?><catalog><book id="bk101"><author>Paco</author><title>Spring-Boot</title></book><book id="bk101"><author>Author Test</author><title>Title Test</title></book><book id="bk101"><author>Paco</author><title>Elastic Search</title></book><book id="bk101"><author>Nobody</author><title>Nothing</title></book>

Could it be any other problem?

Many thanks.

Regards,
Paco.


(Mark Walkom) #4

ES doesn't translate that base64 into proper text though, it will return exactly what is fed into it.


(Paco B) #5

Hello,

I think it is better if I explain to you my case:

I have a XML values in MongoDB which are too large and the MongoDB team recommendation was to use GridFS (MongoDB store mechanism in binary format) and ES to do search in this kind of fields.

I used the "mongo-connector" plugin to have the MongoDB data in ES. I download the elasticsearch-mapper-attachments plugin to support the GridFS implementation, and I understood that ES, when receives this kind of data, could index it and search by whatever String content.

So, I store too large XML documents in MongoDB using GridFS to be searched by ES using index. Is it impossible to do text search in this kind of fields using the ES plugins that I told you before?

Thanks again :slight_smile:


Recommendation about storing too large XML as attachment to be searched
(David Pilato) #6

Actually your example looks correct. And if Content is part of your binary document, then it should be extracted.

I'm just wondering how your field is indexed.

May be you could store the content.content field and then run a match_all query and ask for field ``.

So basically do something like:

DELETE /test
PUT /test
PUT /test/type/_mapping
{
  "type": {
    "properties": {
      "file": {
        "type": "attachment",
        "fields": {
          "content": {
            "type": "string",
            "store": true
          }
        }
      }
    }
  }
}
PUT /test/type/1?refresh=true
{
  "file": "PD94bWwgdmVyc2lvbj0iMS4wIj8+PGNhdGFsb2c+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPlBhY288L2F1dGhvcj48dGl0bGU+U3ByaW5nLUJvb3Q8L3RpdGxlPjwvYm9vaz48Ym9vayBpZD0iYmsxMDEiPjxhdXRob3I+QXV0aG9yIFRlc3Q8L2F1dGhvcj48dGl0bGU+VGl0bGUgVGVzdDwvdGl0bGU+PC9ib29rPjxib29rIGlkPSJiazEwMSI+PGF1dGhvcj5QYWNvPC9hdXRob3I+PHRpdGxlPkVsYXN0aWMgU2VhcmNoPC90aXRsZT48L2Jvb2s+PGJvb2sgaWQ9ImJrMTAxIj48YXV0aG9yPk5vYm9keTwvYXV0aG9yPjx0aXRsZT5Ob3RoaW5nPC90aXRsZT48L2Jvb2s+"
}
GET /test/type/_search
{
  "fields": [ "file. content" ]
}

I'm wondering if the name content you are using as the field name could conflict with the internal field conflict.conflict. The test I propose is using file as the field name.

Let me know how it goes.


(Paco B) #7

Many thanks for your reply :slight_smile:

I get the following outcome:

{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test",
"_type": "type",
"_id": "1",
"_score": 1
}
]
}
}

But, what about to do a search over the content? If I do a search like this, I do not have any result, and the Base64 contains the value :frowning:

GET /test/type/_search
{
"fields": [ "file.content" ],
"query": {
"match": {
"file.content" : "Nothing"
}
}
}

Result:

{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}

Many thanks!

Regards,
Paco.


(David Pilato) #8

Please try to format your examples with preformatted text so it will be easier to read.

This is a strange issue indeed. I thought it was caused by the xml formatting.

If I run this:

DELETE /test
PUT /test
PUT /test/type/_mapping
{
  "type": {
    "properties": {
      "file": {
        "type": "string",
        "store": true
      }
    }
  }
}
PUT /test/type/1?refresh=true
{
  "file": "<?xml version=\"1.0\"?><catalog><book id=\"bk101\"><author>Paco</author><title>Spring-Boot</title></book><book id=\"bk101\"><author>Author Test</author><title>Title Test</title></book><book id=\"bk101\"><author>Paco</author><title>Elastic Search</title></book><book id=\"bk101\"><author>Nobody</author><title>Nothing</title></book>"
}
GET /test/type/_search
{
  "fields": [ "file" ]
}

Everything is fine.

If I run your content, it does not work but the mapper attachments does not complain...

I think this is caused by this: https://github.com/elastic/elasticsearch-mapper-attachments/issues/163

We removed support for many formats. And actually XML does not need to be "extracted" as it's not a binary format. I totally forgot about this and that's why it took me some time to figure this out.

I think that the Tika parser in that case simply returns no content at all.

I opened https://github.com/elastic/elasticsearch/issues/16189 to check what we should do in such a case. I think we should may be warn in logs or reject the document.


(Paco B) #9

Many thanks for your help @dadoonet :slight_smile:

Yes, it is strange, the Base64 string comes from:

MongoDB (GridFS field) -> mongo-connector plugin -> attachment plugin -> ES

For us, it is impossible to have an XML String in MongoDB because it is too large (> 16 MB ***), so we need to store it as Binary (using GridFS) and then, we use ES to do indexed search onto this XML.

Do you know another way to work with XML which is in Base64? If I have to help you to implement this converter in your API, let me know :wink:

Many thanks again!

Regards,
Paco.

*** MongoDB has a maximum document size of 16 MB.


(system) #10