Index PDF in ES

Can i create index for pdf file?
How?

You'll want the ingest-attachment plugin.

You can have a look at Implementing Ingest Attachment Processor Plugin for an example which will help you I think

If you have PDF docs in a folder you can use FSCrawler project as well.

Thanks both of you guys,

Can i give path of file in ingest plugin?

No.

You need to push data to ingest. Ingest can not pull data from a source.

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "path"
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "path": { "type":"{path.home}/data/base64.bin" }
}

Can i use path?

Is there any other plugin i can use with ingest?

No you can't.

But I mentioned FSCrawler project already.

FScrawler is not working it is not even showing any error.
Command fscrawler job_name stuck for an hour on CLI.

I am using WINDOWS 10 and fscrawler 2.2.
Please help me

Yes. It's because of this:

Use the latest SNAPSHOT from here: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/

Thanks, now i am able to create index in elastic search.

Can i create index for images , which are inside pdf?

Do you mean? OCR?

If so, this (https://github.com/dadoonet/fscrawler#ocr-integration-using-tika-and-tesseract) is supposed to work. But someone reported that it might be buggy actually:

I installed tesseract but when i execute any exe inside it it is not running?

I really don't know. I'm sorry. I need to find some spare time to play with it but never got a chance to find a couple of hours yet.

Thanks for helping me :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.