I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:
When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.
Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)
I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.
I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:
When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.
Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)
I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.
I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:
When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.
Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)
I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.
I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:
When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.
Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)
I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.