How to use OCR in Elasticsearch ingest attachment plugin?

I was able to make the plugin work with PDFs that contain searchable text. When I give it a PDF or PNG with non searchable text, it fails to extract the text from the binary data.
Unfortunately, I couldn't find anything useful in the documentation and forums.

Here are the steps I followed to get the plugin up and running ( PHP ):

Executed this command in the bin directory of Elasticsearch

    elasticsearch-plugin install ingest-attachment

Next I created the pipeline

            $client = ClientBuilder::create()->build();
            $params = [
                'id' => 'attachment',
                'body' => [
                    'description' => 'Extract attachment information',
                    'processors' => [
                        [
                            'attachment' => [
                                'field' => 'data'
                            ]
                        ]
                    ]
                ],
            ];
            return $client->ingest()->putPipeline($params);

Then I got the file, encoded it in base64 and attached it to an ES document

            $client = ClientBuilder::create()->build();
            $myfiles = array_diff(scandir('pdf_files'), array('.', '..'));
            $params = [
                'index' => 'candidates',
                'type'  => '_doc',
                'id'    =>  'e9AuBXcBC0zZvKKfMaH9',
                'pipeline' => 'attachment',
                'body'  => [
                    'data' => base64_encode(file_get_contents('./pdf_files/'.$myfiles[2]))
                ]                
            ];
            $response = $client->index($params);

When I fetch the document through the kibana console, I get this response

    "attachment" : {
          "content_type" : "application/pdf",
          "language" : "lt",
          "content" : "",
          "content_length" : 2
        }

As you can see the content property is empty. Any ideas on how to make it work ?

Oh sorry. OCR is not supported by ingest attachment plugin.

Oh, ok thanks. Do you have a suggestion on the best way to implement OCR with Elasticsearch ?

You can use FSCrawler. There's a tutorial to help you getting started.

FSCrawler worked like a charm, it pushed every file in the specified index.
The problem is that I wanted to push only relevant files in a specific document within the index. I couldn't find anything in the documentation that would help me do that. Any suggestion ?

Not sure I understand the question.

But did you look at the REST service of FSCrawler? Might be what you want.

It's easier to setup a custom pipeline in Python. I have used OCRmyPDF to process the pdfs. Then use Tika to extract the text. This can then be bulk indexed into ES

1 Like

I have a list of candidates stored in an ES index and would like to index each PDF file to a specific candidate. How do I go about doing this ? Can it be done with FSCrawler at all ?

That might be it. But it's unclear to me how I would direct FSCrawler to update specific ES documents using the REST service.

Not really. Actually not directly.

What you could do is the suggestion @russellmenezes made.
Another option is to run FSCrawler as a REST Service and use its simulate API: REST service — FSCrawler 2.7-SNAPSHOT documentation

That way you can get the JSON back and update the candidate document with that content.

Or you can also think of it in another way:

  • Create one index for the candidates
  • Create one index for the resumes

Update the candidate document with just a link to the resume.

Highly depends on the use case: how you want to search? What do you want to search for? (I assume candidates)...

1 Like

I liked the simulate API approach, I got it to work via curl :

curl -F "file=@pdf_files/non-searchable-text.pdf" "http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true"

Which gave me this response :

{
  "ok": true,
  "filename": "non-searchable-text.pdf",
  "url": "http://127.0.0.1:9200/resumes/_doc/f39614d4716aed76167498ac4945ed7",
  "doc": {
    "content": "\n \n\nTABLE OF CONTENTS\n\n \n\nIntroduction 1\nChapter 1: The ABC of Programming 11\nChapter 2: Basic JavaScript Instructions 53\nChapter 3: Functions, Methods & Objects 85\n(i aY-] 0) (=) ae Sam DY -101 |} (0) aioe\" Mole) 0-) 145\nChapter 5: Document Object Model 183\nChapter 6: Events 243\nChapter 7: jQuery 293\nChapter 8: Ajax & JSON 367\nChapter 9: APIs 409\nChapter 10: Error Handling & Debugging 449\nChapter 11: Content Panels 487\nChapter 12: Filtering, Searching & Sorting 527\nChapter 13: Form Enhancement & Validation 567\nTare toy 623\n\n \n\nTry out & download the code in this book\nwww. javascriptbook.com\n\n \n\n \n\n \n\n\n",
    "meta": {
      "date": "2021-02-02T11:44:56.000+00:00",
      "format": "application/pdf; version=1.6"
    },
    "file": {
      "extension": "pdf",
      "content_type": "application/pdf",
      "indexing_date": "2021-02-04T13:49:29.353+00:00",
      "filename": "non-searchable-text.pdf"
    },
    "path": {
      "virtual": "non-searchable-text.pdf",
      "real": "non-searchable-text.pdf"
    }
  }
}

I tried making it work through PHP like this :

$curl = curl_init();

curl_setopt($curl, CURLOPT_URL, 'http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_POST, 1);
$args['file'] = '@/pdf_files/non-searchable-text.pdf';
curl_setopt($curl, CURLOPT_POSTFIELDS, $args);

$result = curl_exec($curl);
if (curl_errno($curl)) {
    echo 'Error:' . curl_error($curl);
} else {
    echo '<pre> response : ', print_r($result, true) ,'</pre>';
}
curl_close($curl);

But this was the response :

curl response

Any idea what I'm doing wrong ?

The problem is just in the request, it's working now. Thank you for your help @dadoonet , FSCrawler is an amazing solution to extracting and indexing content from files in an optimal way to Elasticsearch. I don't know why there is no official ES support for it yet but it certainly deserves it.

Here is the PHP curl request for the curious :

$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

$mimetype = mime_content_type('FULLPATH GOES HERE');
$curlFile = new CURLFile('FULLPATH GOES HERE', $mimetype); 

$postFields = array('id' => 'PDFTESTESEARCH', 'file' => $curlFile);

curl_setopt($curl, CURLOPT_URL, 'http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true');
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS, $postFields);

$result = curl_exec($curl);

if (curl_errno($curl)) {
    echo 'Error:' . curl_error($curl);
} else {
    echo '<pre> response : ', print_r( json_decode($result), true) ,'</pre>';
    var_dump(json_decode($result));
}
curl_close($curl);
2 Likes

Thanks for giving a closure, sharing your solution and the kind words. :wink::hugs:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.