Index PDF in ES

wolfghost · March 22, 2017, 6:34am

Can i create index for pdf file?
How?

shanec · March 22, 2017, 6:42am

You'll want the ingest-attachment plugin.

You can have a look at Implementing Ingest Attachment Processor Plugin for an example which will help you I think

dadoonet · March 22, 2017, 7:05am

If you have PDF docs in a folder you can use FSCrawler project as well.

wolfghost · March 22, 2017, 10:39am

Thanks both of you guys,

Can i give path of file in ingest plugin?

dadoonet · March 22, 2017, 12:30pm

No.

You need to push data to ingest. Ingest can not pull data from a source.

wolfghost · March 22, 2017, 12:48pm

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "path"
      }
    }
  ]
}
PUT my_index/my_type/my_id?pipeline=attachment
{
  "path": { "type":"{path.home}/data/base64.bin" }
}

Can i use path?

Is there any other plugin i can use with ingest?

dadoonet · March 22, 2017, 2:19pm

No you can't.

But I mentioned FSCrawler project already.

wolfghost · March 23, 2017, 7:45am

FScrawler is not working it is not even showing any error.
Command fscrawler job_name stuck for an hour on CLI.

I am using WINDOWS 10 and fscrawler 2.2.
Please help me

dadoonet · March 23, 2017, 10:20am

Yes. It's because of this:

Use the latest SNAPSHOT from here: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler/2.3-SNAPSHOT/

wolfghost · March 23, 2017, 12:59pm

Thanks, now i am able to create index in elastic search.

Can i create index for images , which are inside pdf?

dadoonet · March 23, 2017, 1:19pm

Do you mean? OCR?

If so, this (https://github.com/dadoonet/fscrawler#ocr-integration-using-tika-and-tesseract) is supposed to work. But someone reported that it might be buggy actually:

wolfghost · March 24, 2017, 9:14am

I installed tesseract but when i execute any exe inside it it is not running?

dadoonet · March 24, 2017, 9:33am

I really don't know. I'm sorry. I need to find some spare time to play with it but never got a chance to find a couple of hours yet.

wolfghost · March 27, 2017, 11:10am

Thanks for helping me

system · April 24, 2017, 11:10am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch - attachment using Ingest - with node.js Elasticsearch	2	2614	June 21, 2017
Indexing word, pdf documents? Elasticsearch	12	6119	July 7, 2020
Can we index .zip file using ingest attachment plugin? Elasticsearch	13	3619	April 25, 2019
How to index PDF file data and search data from attachment PDF file Elastic Search elastic-app-search	7	7780	March 29, 2021
Search a PDF file using its content Elasticsearch	9	15787	February 11, 2019

Index PDF in ES

Related topics