Getting substrings from indexed PDF files using termvector's position offsets

(apanimesh061) #1

I have indexed some PDF files as Base64 to ES using he attachment plugin. Search and Termvector API are working as expected. But I wish to get some text from the PDF file using the position offsets of the tokens returned by the Termvector API.

The PDFs are quite big so want to avoid using external libraries like PDFBox. Is there any inbuilt utility in ES that may help me getting some text from the index Base64?

(system) #2