Getting substrings from indexed PDF files using termvector's position offsets

apanimesh061 · July 24, 2015, 10:10pm

I have indexed some PDF files as Base64 to ES using he attachment plugin. Search and Termvector API are working as expected. But I wish to get some text from the PDF file using the position offsets of the tokens returned by the Termvector API.

The PDFs are quite big so want to avoid using external libraries like PDFBox. Is there any inbuilt utility in ES that may help me getting some text from the index Base64?

Topic		Replies	Views
Howto: Access Character Offset of term in string field Elasticsearch	4	583	July 6, 2017
Can Elasticsearch return position of the text within document Elasticsearch	5	492	September 22, 2022
ElasticSearch Attachment Plugin: PDF exact position? Elasticsearch	2	354	July 6, 2017
Offset of only one term in a document, not document vector Elasticsearch	5	559	July 5, 2017
Searching PDF Elasticsearch	5	637	July 6, 2017

Getting substrings from indexed PDF files using termvector's position offsets

Related topics