Need some help with Ingest Attachment plugin

kanon123 · April 29, 2018, 8:29am

Hi,
I got the plugin to work, i can take PDF file and index it using base64.
I wanted to know if there is any way to analyze the PDF text and get the words count or get the keywords.

Thanks.

dadoonet · April 30, 2018, 7:09am

keywords might be extracted if available in the document itself as metadata.
words count are not available. Hopefully at some point you'll get able to get them from metadata as well when this will be merged:

kanon123 · April 30, 2018, 7:48am

When i use Ingest Attachment i don't get keywords any anything that can help me.
I just get the file content as string so i can search text on it.
I need to know if there is any way to get keywords or split all the text to words, sentences, etc..

dadoonet · April 30, 2018, 8:07am

May be share somewhere your binary document that you are indexing? I mean if you don't have keywords as metadata of the document, you can't get keywords extracted then.

But once you have the extracted text, you can may set fielddata to true on that field and compute a terms aggregation to get the most frequent terms? But this is going to use a lot of memory and probably you might get terms like the, a, one....

kanon123 · April 30, 2018, 8:12am

Let's say i'm working on this PDF file:
http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

I'm following this guide: https://github.com/rahulsinghai/elasticsearch-ingest-attachment-plugin-example
It looks like they only thing it's doing is taking the base64 data and insert it to elasticsearch db.

dadoonet · April 30, 2018, 8:29am

AFAICS there is no keyword in this document. That's why it's not extracted.

(Mots-clés means keyword)

system · May 28, 2018, 8:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Word count from documents Elasticsearch	10	7599	February 28, 2018
Indexing word, pdf documents? Elasticsearch	12	6119	July 7, 2020
Searching through PDF attachments and other documents in ElasticSearch with one query Elasticsearch	6	1704	October 29, 2020
How Attachments or file storage and searching is handled in Elasticsearch Elasticsearch	7	1439	August 13, 2020
PDF- ingest attachement plugin Elasticsearch	2	449	April 3, 2018

Need some help with Ingest Attachment plugin

Related topics