Need some help with Ingest Attachment plugin

I got the plugin to work, i can take PDF file and index it using base64.
I wanted to know if there is any way to analyze the PDF text and get the words count or get the keywords.


keywords might be extracted if available in the document itself as metadata.
words count are not available. Hopefully at some point you'll get able to get them from metadata as well when this will be merged:

When i use Ingest Attachment i don't get keywords any anything that can help me.
I just get the file content as string so i can search text on it.
I need to know if there is any way to get keywords or split all the text to words, sentences, etc..

May be share somewhere your binary document that you are indexing? I mean if you don't have keywords as metadata of the document, you can't get keywords extracted then.

But once you have the extracted text, you can may set fielddata to true on that field and compute a terms aggregation to get the most frequent terms? But this is going to use a lot of memory and probably you might get terms like the, a, one....

Let's say i'm working on this PDF file:

I'm following this guide:
It looks like they only thing it's doing is taking the base64 data and insert it to elasticsearch db.

AFAICS there is no keyword in this document. That's why it's not extracted.

(Mots-clés means keyword)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.