Is it necessary to use Ingest Attachment Processor to index pdf files


(Rahul Nama) #1

Hello All

I'm confused whether to use Ingest Attachment processor to index pdf files or not?

I'm already converting the pdf to text and extracting the metadata using python.
Now I guess I can directly send the text to elasticsearch.

What's the use of Ingest Attachment processor?


(David Pilato) #2

In that case, you don't need it.


(Rahul Nama) #3

@dadoonet

any dis-advantages of not using it?

Now when I index, In kibana I can see the indexed content as

This is because I haven't formatted/parsed the text after converting from pdf. for doing so, using ingest-processor will ease my work?


(David Pilato) #4

I don't think it will.


(Rahul Nama) #5

@dadoonet

okay. If possible can you please tell the advantages of using ingest attachment processor over converting to text and post the json into elasticsearch?

Thank you :slight_smile:


(David Pilato) #6

Well. Ingest attachment is already a working product that you don't have to write and maintain.
On the opposite it can be limited to some formats and can consume node memory.

That's one of the reason I wrote FSCrawler project which is running outside elasticsearch, using Tika as well, but all Tika which means more file types are supported, including OCR as well.

But if you are happy with the extraction you did on your side, then just use it. That's perfectly fine IMO.

My 0.0.5 cents.


(Rahul Nama) #7

@dadoonet

Thanks for FSCrawler. I just went through it.

Can you give me an insight on how FSCrawler indexes the docs to elasticsearch. Page by page or in some other way. And how are indexes created.

Will be more helpful if you give any additional information of what all libraries did you use in FSCrawler(I saw tesseract and Tika).
I understand you have very less time, please respond accordingly. appreciate your contributions. Thank you so much :slight_smile:


(David Pilato) #8

FSCrawler just embed Tika. It does what Tika does.
There is no yet page per page extraction in Tika. All the content is flattened.

FSCrawler then sends what has been collected to elasticsearch.


(Rahul Nama) #9

Thank you @dadoonet

I will download and go through FSCrawler :slight_smile:


(Rahul Nama) #10

@dadoonet

I've indexed a few files with FSCrawler. Now, I would like to improve the document relevancy.

As per the documentation, I need to perform analysis(kind of NLP) Phase to improve the relevancy.

How should I apply analysers/tokenizers once I've indexed the files.

I saw we need to PUT mapping with required analyser or tokenizer, Could please explain when to apply this analysers/tokenizers ?


(David Pilato) #11

How should I apply analysers/tokenizers once I've indexed the files.

I guess it's too late. This must be done at index time.
So you define your mapping with whatever analyzer you want to use on the content field for example before indexing the first document.
The analyzer will be used at index time and search time.


(Rahul Nama) #12

@dadoonet

Thank you. I will work on this :slight_smile:

In between, we have indexed pdf files which is not satisfying our requirement. We have to index pdf page by page. but tika or any other libraries is not supporting page by page extraction.

requirement: whenever user search something the relevant page/data(but not whole pdf) should be displayed

1. Is there any other way to index pdf files page by page.

2. Can we achieve this using ingest attachment processor

Please let me know, Any suggestions will be helpful.

Thanks for your time :slight_smile:


(David Pilato) #13
  1. I don't know
  2. No. As far as I know unless you write your own plugin which does that.

(Rahul Nama) #14

@dadoonet

okay :slight_smile:

I will work on it and will let you know If I find any.

Thank you :smile:


(Rahul Nama) #15

Hi @dadoonet

One last doubt

If we Index a file is it possible to return only a part(like 5-10 sentences) of file which matches user query?


(David Pilato) #16

Have a look at highlighting feature


(Rahul Nama) #17

sure @dadoonet

thanks :slight_smile:


(Tim Allison) #18

If you get traditional xhtml from Tika, it shouldn't be too hard to scrape out the "<div class="page">...</div>" elements. @dadoonet is right, though, that per page extraction doesn't currently exist in "off-the-shelf" Tika.


(Rahul Nama) #19

Hey @Tim_Allison

Thanks for the suggestion.

I will work on it and will share what I find. :slight_smile:

-Rahul


(Rahul Nama) #20

hey @Tim_Allison

Thanks you so much. It worked and very much close to my exact requirement. :smiley:

Thanks again :wink: