Is it necessary to use Ingest Attachment Processor to index pdf files

rahulnama · September 12, 2018, 6:50am

Hello All

I'm confused whether to use Ingest Attachment processor to index pdf files or not?

I'm already converting the pdf to text and extracting the metadata using python.
Now I guess I can directly send the text to elasticsearch.

What's the use of Ingest Attachment processor?

dadoonet · September 12, 2018, 7:27am

In that case, you don't need it.

rahulnama · September 12, 2018, 9:07am

@dadoonet

any dis-advantages of not using it?

Now when I index, In kibana I can see the indexed content as

This is because I haven't formatted/parsed the text after converting from pdf. for doing so, using ingest-processor will ease my work?

dadoonet · September 12, 2018, 11:59am

I don't think it will.

rahulnama · September 12, 2018, 1:09pm

@dadoonet

okay. If possible can you please tell the advantages of using ingest attachment processor over converting to text and post the json into elasticsearch?

Thank you

dadoonet · September 12, 2018, 4:09pm

Well. Ingest attachment is already a working product that you don't have to write and maintain.
On the opposite it can be limited to some formats and can consume node memory.

That's one of the reason I wrote FSCrawler project which is running outside elasticsearch, using Tika as well, but all Tika which means more file types are supported, including OCR as well.

But if you are happy with the extraction you did on your side, then just use it. That's perfectly fine IMO.

My 0.0.5 cents.

rahulnama · September 13, 2018, 3:11am

@dadoonet

Thanks for FSCrawler. I just went through it.

Can you give me an insight on how FSCrawler indexes the docs to elasticsearch. Page by page or in some other way. And how are indexes created.

Will be more helpful if you give any additional information of what all libraries did you use in FSCrawler(I saw tesseract and Tika).
I understand you have very less time, please respond accordingly. appreciate your contributions. Thank you so much

dadoonet · September 13, 2018, 4:50am

FSCrawler just embed Tika. It does what Tika does.
There is no yet page per page extraction in Tika. All the content is flattened.

FSCrawler then sends what has been collected to elasticsearch.

rahulnama · September 14, 2018, 4:33am

Thank you @dadoonet

I will download and go through FSCrawler

rahulnama · September 14, 2018, 4:45pm

@dadoonet

I've indexed a few files with FSCrawler. Now, I would like to improve the document relevancy.

As per the documentation, I need to perform analysis(kind of NLP) Phase to improve the relevancy.

How should I apply analysers/tokenizers once I've indexed the files.

I saw we need to PUT mapping with required analyser or tokenizer, Could please explain when to apply this analysers/tokenizers ?

dadoonet · September 18, 2018, 1:01pm

How should I apply analysers/tokenizers once I've indexed the files.

I guess it's too late. This must be done at index time.
So you define your mapping with whatever analyzer you want to use on the content field for example before indexing the first document.
The analyzer will be used at index time and search time.

rahulnama · September 19, 2018, 4:47am

@dadoonet

Thank you. I will work on this

In between, we have indexed pdf files which is not satisfying our requirement. We have to index pdf page by page. but tika or any other libraries is not supporting page by page extraction.

requirement: whenever user search something the relevant page/data(but not whole pdf) should be displayed

1. Is there any other way to index pdf files page by page.

2. Can we achieve this using ingest attachment processor

Please let me know, Any suggestions will be helpful.

Thanks for your time

dadoonet · September 19, 2018, 5:18am

I don't know
No. As far as I know unless you write your own plugin which does that.

rahulnama · September 19, 2018, 5:27am

@dadoonet

okay

I will work on it and will let you know If I find any.

Thank you

rahulnama · September 19, 2018, 10:30am

Hi @dadoonet

One last doubt

If we Index a file is it possible to return only a part(like 5-10 sentences) of file which matches user query?

dadoonet · September 19, 2018, 10:55am

Have a look at highlighting feature

rahulnama · September 19, 2018, 11:05am

sure @dadoonet

thanks

Tim_Allison · September 20, 2018, 1:22pm

If you get traditional xhtml from Tika, it shouldn't be too hard to scrape out the "<div class="page">...</div>" elements. @dadoonet is right, though, that per page extraction doesn't currently exist in "off-the-shelf" Tika.

rahulnama · September 20, 2018, 2:34pm

Hey @Tim_Allison

Thanks for the suggestion.

I will work on it and will share what I find.

-Rahul

rahulnama · September 24, 2018, 5:13am

hey @Tim_Allison

Thanks you so much. It worked and very much close to my exact requirement.

Thanks again

Topic		Replies	Views
Mapper-attachment vs Ingest-attachment with OCR Elasticsearch	3	1764	December 13, 2016
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Index PDF in ES Elasticsearch	14	9109	April 24, 2017
Is it inefficient to index PDF files in Elasticsearch Elasticsearch	8	4136	August 25, 2017
Advantages of base64 encoded content in ingest attachment plugin Elasticsearch	3	1571	May 1, 2018

Is it necessary to use Ingest Attachment Processor to index pdf files

Related topics