Extracting full-text from an attachment field


(silversun) #1

I am a newbie for Elasticsearch. I'm working on an application which integrates Drupal with Elasticsearch 2.4.

Our current ES index has data with a field called "file_resource_url" with a full URL path to a PDF/DOC file. However, the text within those PDF/DOC file is currently not indexed and those attachments are not searchable.

I would like to users to be able to search against the full-text of the documents specified in the "file_resource_url" field.

Please provide guidance on how to go about this.

Also, note that the app with ES is currently deployed on a cloud foundry instance where my access is quite limited in terms of installing something new.


(Christoph) #2

That's a shame, because there is the Ingest Attachment Processor Plugin (or the Mapper Attachment Plugin for pre-6.0 indices) that should be well suited for the job. I would check out if you can install it anyway.

If not, maybe its time to look into our Elastic Cloud offering, I'm pretty sure the plugin is available there.


(silversun) #3

Thanks for your response! I can reach out to them about Mapper Attachment plugin. I have installed that plugin in my local. But, I am unable to update the field type to attachment.

Here is how my index currently looks -

{
"took": 13,
"timed_out": false,
"_shards": {
"total": 14,
"successful": 14,
"failed": 0
},
"hits": {
"total": 172,
"max_score": 1,
"hits": [
{
"_index": "appindex",
"_type": "appindex",
"_id": "8",
"_score": 1,
"_source": {
"id": 8,
"body:format": "filtered_html",
"body:value": "text text text",
"changed": "1506460443",
 ...
"field_attachment_resource:file:fid": "12",
"field_attachment_resource:file:mime": "application/pdf",
"field_attachment_resource:file:name": "SampleDocument.pdf",

This is the current mapping of "field_attachement_resource":

{
  "appindex" : {
    "mappings" : {
      "appindex" : {
        "properties" : {  
		"field_attachment_resource:file:fid" : {
            "type" : "string"
          },
          "field_attachment_resource:file:mime" : {
            "type" : "string"
          },
          "field_attachment_resource:file:name" : {
            "type" : "string"
          },

I tried to update the type of "field_attachment_resource:file:name" from string to "attachment" -

curl -X PUT 'http://localhost:9200/appindex/appindex/_mapping?ignore_conflicts=true' -d \
 '{
   "field_attachment_resource": {
     "properties": {
       "name": {
         "type": "attachment"
       }
     }
   }
}'

It failed with this error -

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [field_attachment_resource : {properties={name={type=attachment}}}]"}],"type":"mapper_parsing_exception","reason":"Root mapping definition has unsupported parameters: [field_attachment_resource : {properties={name={type=attachment}}}]"},"status":400}

Please advise on what I might be wrong. Thanks for your help!!!!


(Christoph) #4

At first glance this looks like the plugin didn't get installed properly. Have you restarted all nodes? Can you check if the hello-world example from https://www.elastic.co/guide/en/elasticsearch/plugins/5.6/mapper-attachments-helloworld.html (or your ES version) works?


(silversun) #5

Sure, let me try that real quick.


(silversun) #6

I just tried the Hello World for Elastic search 2.4. Looks like it ran successfully. Please see below.

curl -X POST "localhost:9200/trying-out-mapper-attachments/person/_search" -H 'Content-Type: application/json' -d'
  {
   "query": {
  "query_string": {
   "query": "ipsum"
> }}}
> '

{"took":65,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.095891505,"hits":[{"_index":"trying-out-mapper-attachments","_type":"person","_id":"1","_score":0.095891505,"_source":
{
"cv": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
}]}}


(silversun) #7

@cbuescher As Hello World looks like it is working well (response copied above). I wonder if my CURL command to update mapping of an existing field is not right. Could you please advise on that?


(Christoph) #8

From the docs it looks like the right incantation is

PUT index/_mapping/type

So I guess in your case it might be appindex/_mapping/appindex (assuming appindex is also a type, which is a bit confusing)


(silversun) #9

This is what I have used to update mapping -

curl -X PUT 'http://localhost:9200/appindex/appindex/_mapping?ignore_conflicts=true' -d \
'{
  "appindex": {
    "properties": {
      "field_attachment_resource:file:name": {
        "type": "attachment"
      }
    }
  }
}'

It give me this error -

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [field_attachment_resource:file:name] of different type, current_type [string], merged_type [attachment]"}],"type":"illegal_argument_exception","reason":"mapper [field_attachment_resource:file:name] of different type, current_type [string], merged_type [attachment]"},"status":400


(silversun) #10

@cbuescher

I got the same error for appindex/_mapping/appindex -

`

{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [field_attachment_resource:file:name] of different type, current_type [string], merged_type [attachment]"}],"type":"illegal_argument_exception","reason":"mapper [field_attachment_resource:file:name] of different type, current_type [string], merged_type [attachment]"},"status":400}

`

This was the PUT command I used -

curl -X PUT 'http://localhost:9200/appindex/_mapping/appindex?ignore_conflicts=true' -d \
'{
  "appindex": {
    "properties": {
      "field_attachment_resource:file:name": {
        "type": "attachment"
      }
    }
  }
}

Please advise.


(Christoph) #11

This is a different kind of error now. Note that before you reported you get

while now the error message says:

This means the field field_attachment_resource:file:name was somehow defined before, which makes sense because you already mapped it to "String". You either need to use a different name now or create the mapping with the "attachment" type to begin with.


(silversun) #12

Could you please advise me on how I go about changing the type of an existing field? That error is a result of an attempt to change the mapping type from string to attachment


(Christoph) #13

Unfortunately you can't change an existing field after you have indexed something. You will have to put the correct mapping from the start.


(silversun) #14

Oh, really? That is highly disappointing. So, we cannot update a mapping once indexing has occurred? That is so inflexible of Elasticsearch.


(Christoph) #15

Sorry to disappoint here but this is a fundamental principle in Lucene, the underlying search library. What would happen otherwise to fields in documents that are already indexed? Should they be dropped? Overwritten? Ignored? Changed would be the best option, but this requires re-indexing, which we have a convenient API for. Please take a look at that.


(silversun) #16

@cbuescher

I can certainly re-index it using the re-index API. https://www.elastic.co/guide/en/elasticsearch/reference/2.4/docs-reindex.html

But, I am confused - I can create a new index and call it new-index and re-index my data from old-index to new-index. But, where in this process, do I update the field type from "string" to "attachment"? Please advise.


(Christoph) #17

From that page:

Reindex does not attempt to set up the destination index. It does not copy the settings of the source index. You should set up the destination index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.


(silversun) #18

@cbuescher

I created a new index and created a mapping with the same field name as that of the old index but of type "attachment" and then ran re-index API -

Got this error -

{  
         "index":"newindex-attachments",
         "type":"appindex",
         "id":"119",
         "cause":{  
            "type":"mapper_parsing_exception",
            "reason":"failed to parse",
            "caused_by":{  
               "type":"json_parse_exception",
               "reason":"Failed to decode VALUE_STRING as base64 (MIME-NO-LINEFEEDS): Illegal character '.' (code 0x2e) in base64 content\n at [Source: org.elasticsearch.common.io.stream.InputStreamStreamInput@50ffdf13; line: 1, column: 748]"
            }
         },
         "status":400
      },

Since the type of new field for the destination index is "attachment", it is looking for base64 text from the source index field when copying it over. I think that is why it is erroring out. Please advise how to go about this.


(Christoph) #19

This is exactly the reason why we don't allow changing a fields mapping on an existing index. You have documents in your old index where field "fooBar" is a String. Now you want field "fooBar" to be an attachment. You will need to decide what to do with the documents that already have strings in there (and they contain e.g. dots and other stuff). Do you want to delete them? Do that either on the source index before reindexing or filter them out from the reindex action. Do you want to replace the old value with something else? Then re-index won't do all the work for you, you will need to somehow do the replacement logic in an own client-side script.
Maybe in this case its easiest to introduce a new field for the attachment and rewrite your application to use that? Lots of options, but it depends on what you want to do.


(system) #20

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.