I was able to make the plugin work with PDFs that contain searchable text. When I give it a PDF or PNG with non searchable text, it fails to extract the text from the binary data.
Unfortunately, I couldn't find anything useful in the documentation and forums.
Here are the steps I followed to get the plugin up and running ( PHP ):
Executed this command in the bin directory of Elasticsearch
elasticsearch-plugin install ingest-attachment
Next I created the pipeline
$client = ClientBuilder::create()->build();
$params = [
'id' => 'attachment',
'body' => [
'description' => 'Extract attachment information',
'processors' => [
[
'attachment' => [
'field' => 'data'
]
]
]
],
];
return $client->ingest()->putPipeline($params);
Then I got the file, encoded it in base64 and attached it to an ES document
$client = ClientBuilder::create()->build();
$myfiles = array_diff(scandir('pdf_files'), array('.', '..'));
$params = [
'index' => 'candidates',
'type' => '_doc',
'id' => 'e9AuBXcBC0zZvKKfMaH9',
'pipeline' => 'attachment',
'body' => [
'data' => base64_encode(file_get_contents('./pdf_files/'.$myfiles[2]))
]
];
$response = $client->index($params);
When I fetch the document through the kibana console, I get this response
"attachment" : {
"content_type" : "application/pdf",
"language" : "lt",
"content" : "",
"content_length" : 2
}
As you can see the content property is empty. Any ideas on how to make it work ?