Nested objects and fragment highlighting


(johno-2) #1

Hi there,

I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:

{:name => "A name",
:attachments => [
{:name => "first attachment", :pages => [
{:number => 1, :text => "page 1 contents"},
{:number => 2, :text => "page 2 contents"},
]
]}

When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.

Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)

I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.

Any ideas how to solve this?

johno


(David Pilato) #2

Perhaps, you could index pages instead of a full doc containing many pages as you search for pages.

Using _parent / _childs, you could perhaps make the link between the document and its pages.

I didn't do it yet by myself but thinking of it. So I don't know if it will work or not.

Hope this could help
David :wink:

Le 27 juin 2011 à 13:39, johno johno@jsmf.net a écrit :

Hi there,

I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:

{:name => "A name",
:attachments => [
{:name => "first attachment", :pages => [
{:number => 1, :text => "page 1 contents"},
{:number => 2, :text => "page 2 contents"},
]
]}

When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.

Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)

I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.

Any ideas how to solve this?

johno


(johno-2) #3

Hi, David. Thanks for the response, but as I written in my first post.
I've already tried that, but there is another problem with scoring.

On Jun 27, 1:53 pm, David Pilato da...@pilato.fr wrote:

Perhaps, you could index pages instead of a full doc containing many pages as you search for pages.

Using _parent / _childs, you could perhaps make the link between the document and its pages.

I didn't do it yet by myself but thinking of it. So I don't know if it will work or not.

Hope this could help
David :wink:

Le 27 juin 2011 à 13:39, johno jo...@jsmf.net a écrit :

Hi there,

I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:

{:name => "A name",
:attachments => [
{:name => "first attachment", :pages => [
{:number => 1, :text => "page 1 contents"},
{:number => 2, :text => "page 2 contents"},
]
]}

When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.

Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)

I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.

Any ideas how to solve this?

johno


(David Pilato) #4

Sorry. Didn't read well the end of your post :frowning: shame on me !

So, i have no other idea but I would love to see a solution for that.

On my project, we have made a "group" like solution. We have two field :
Field1
Field2

We create a EsField which contains the content of field1 + separator + content of field2.
Such as content1KKKKcontent2

So if i want documents containing a line with content1 AND content2, we can search for content1KKKKcontent2.

I'm not convinced that it can be useful for your use case.

Cheers
David :wink:

Le 27 juin 2011 à 14:15, johno johno@jsmf.net a écrit :

Hi, David. Thanks for the response, but as I written in my first post.
I've already tried that, but there is another problem with scoring.

On Jun 27, 1:53 pm, David Pilato da...@pilato.fr wrote:

Perhaps, you could index pages instead of a full doc containing many pages as you search for pages.

Using _parent / _childs, you could perhaps make the link between the document and its pages.

I didn't do it yet by myself but thinking of it. So I don't know if it will work or not.

Hope this could help
David :wink:

Le 27 juin 2011 à 13:39, johno jo...@jsmf.net a écrit :

Hi there,

I am indexing documents created from pdf files. These files are
processed and broken down to pages with text contained on them. Sample
document looks like this:

{:name => "A name",
:attachments => [
{:name => "first attachment", :pages => [
{:number => 1, :text => "page 1 contents"},
{:number => 2, :text => "page 2 contents"},
]
]}

When I search for text with highlights i get response with
field :highlight => {"attachments.pages.text" => array_of_fragments}.

Now the problem is, that I lose information about on which page/
attachment the highlight is in. (Or any other possible page fields.)

I've also tried creating/indexing pages as separate documents, where i
get all the fields back. However by doing so I cannot find a way to
group them by attachment_id/document_id and thus I lose to ability to
score attachments/documents with multiple matching pages better.

Any ideas how to solve this?

johno


(system) #5