Search on word/pdf files using rest high level client

I am using Elasticsearch Java REST High Level Client 6.4 in my project.
I am giving search option for Knowledge module. While creating a knowledge I am creating a document with the following properties Id,Title,Description,State, Article.
I am also allowing users to upload Article as attachment. If the article is uploaded as an attachment I want my search to look into those attachments as well for picking the matching knowledge records. How can I achieve this ?

Before my project was using an older version of elasticsearch and mapper attachment plugin was used for attachments. In the legacy code I found that the attachment had been converted into base64 string and saved with other properties in the knowledge document.
In the search code it was being searched like any other properties.
Now I have upgraded elasticsearch to 6.4 REST High Level Client. How can I do this ?
I saw that mapper attachments plugin had been replaced with ingest plugin. But I couldn't understand how exactly to use it in my usecase.
Below given is the sample of my previous mapping . After upgrading to new es, I am not allowed to give type as attachment
{
"Knowledge" : {
"_id" : {
"path" : "Id"
},
"properties" : {
"Id" : {
"type" : "integer"
},
"Title" : {
"type" : "string"
},
"Topic" : {
"type" : "string"
},
"Article" : {
"type" : "string"
},
"State" : {
"type" : "string"
},
"Article Attachment" : {
"type" : "attachment"
},
"Rating" : {
"type" : "float"
},
"Views" : {
"type" : "integer"
},
"Owner" : {
"type" : "string"
},
"Created On" : {
"type" : "date",
"format" : "yyyy-MM-dd HH:mm:ss"
}
}
}
}

The main difference between mapper-attachments and ingest-attachment is that ingest-attachment now modifies the _source of the JSON document and THEN index the document.
Mapper-attachments was creating invisible fields when indexing.

Instead of:

"Article Attachment" : {
  "type" : "attachment"
},

You can use something like:

"Article Attachment" : {
  "type" : "text"
},

In your ingest pipeline foo, define an ingest attachment processor to extract from whatever field you want (ie base64) to the field named Article Attachment and add a remove processor to remove the base64 field.

Then just index your documents like:

POST index/_doc?pipeline=foo
{
  "base64": "XYZ12345678"
}

That should do what you expect.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.