Lucene Documents

tsp · November 5, 2020, 10:04am

Hi all,

I am not too sure that this is the right place. But we are new to build our own full-text engine. We are planning to index files on the fly using the Lucence Java API.

That would be an easy requirement as such, but (there always is a but, right?) the raw text files we receive are paged using custom page splits.

The requirement is to find the page in which the searched text is found. Therefore a document = page and not a file. I could build some custom code to analyze the pages prior indexing and pass these over to the indexer, but these files being scanned during indexing I would like to customize the parser to create a new document each time a page is found.

Ideal scenario would be to be able to retrieve either the file, the page or even the line the searched text is embedded in.

Is that doable? And if yes, would someone have a walkthrough?

It would be highly appreciated if someone has an answer.

warkolm · November 5, 2020, 9:59pm

Welcome to our community!

While we have expertise in Lucene, building something for yourself directly on top of it is outside the scope of what we can help with here sorry to say.

tsp · November 6, 2020, 8:00am

That's ok.

Thanks.

system · December 4, 2020, 8:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FSCrawler Index Each Page as a Separate Document Elasticsearch	2	868	October 18, 2019
Possible to Index PDFs by page? Elasticsearch	6	3849	July 6, 2017
Store PDF documents page-wise in ES Elasticsearch	2	575	November 23, 2019
Getting metadata of the extracted text from a file Elasticsearch	3	593	July 23, 2018
Handling Page breaks in Elasticsearch Elasticsearch	3	495	November 29, 2018

Lucene Documents

Related topics