Highlighting with parent/child queries

I am quite interested to use the top_children query but I would need to get the info about the children, mainly their ids and some highlight. I know it is not implemented and I would like to provide an implementation.

First, does anybody is already working on it ?

If not, here is a suggested API.
On the query side, we would specify we want the children to be returned along with their parent data, and optionally some highlighting on the children fields.
{
"query": {
"top_children" : {
"type": "subcontent",
"query" : {
"term" : {
"name" : "bike"
}
}
"score" : "max",
"factor" : 5,
"incremental_factor" : 2
}
},
"children": { // just having this tags means we at least want the ids of the children
"size" : 2, // maximum number of children by parent to return
"full_data" : false, // if true return not just the id of the child but also its data
"highlight" : { // same syntax as for normal queries
"fields" : {
"name" : {}
}
}
}
}

Then as results we would have something like:
"hits" : {
"total" : 7,
"max_score" : 3.366573
"hits" : [
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8001,
"_score" : 3.366573,
"children": {
"total" : 4,
"hits" : [ // hits on children of this current parent, ordered by score
{
"_id" : 654,
"highlight" : {
"name" : [ "my lovely bike" ]
}
},
{
"_id" : 987,
"highlight" : {
"name" : [ "my nice bike" ]
}
}
]
}
},
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8004,
.... etc....
}

With this structure I guess we cover it all.

Now about the implementation.

First the information about the children needs to be kept along between the time the query is executed and the time the highlight phase happen. The simplest way of doing that seems to have the query implementation holding it. Then later on, the highlight having access to the context and thus to the query, the info on the children/parent association is accessible and ordered by score.
I am starting to think about an interface implemented by both the TopChildrenQuery and the BlockJoinQuery which would provide child info for a parent doc, ordered by score.

We would also need an implementation of FieldQuery which will only lookup to decompose the sub queries of TopChildrenQuery and BlockJoinQuery. Idem, an interface would expose on both queries an accessor to the query on the children.

About the highlighting of the children, the current code if HighlightPhase seems to fit at almost 80%. I guess a refactor is needed to extract from that code a ESHighlighter which would be used by both the parent highlight and the child one.

Since everything is done document by document, I don't see any issue regarding the sharding (I'm quite far from an expert in that area though).

I don't know yet how to properly get the data of the children, but I would probably get a lot of inspiration from the fetch phase.

Since I haven't started to code and I don't know that much elasticsearch code, at least does that makes sense ?

Nicolas

No news so I'll start working on it.

Nicolas

Le 22 août 2011 à 17:17, Nicolas Lalevée a écrit :

I am quite interested to use the top_children query but I would need to get the info about the children, mainly their ids and some highlight. I know it is not implemented and I would like to provide an implementation.

First, does anybody is already working on it ?

If not, here is a suggested API.
On the query side, we would specify we want the children to be returned along with their parent data, and optionally some highlighting on the children fields.
{
"query": {
"top_children" : {
"type": "subcontent",
"query" : {
"term" : {
"name" : "bike"
}
}
"score" : "max",
"factor" : 5,
"incremental_factor" : 2
}
},
"children": { // just having this tags means we at least want the ids of the children
"size" : 2, // maximum number of children by parent to return
"full_data" : false, // if true return not just the id of the child but also its data
"highlight" : { // same syntax as for normal queries
"fields" : {
"name" : {}
}
}
}
}

Then as results we would have something like:
"hits" : {
"total" : 7,
"max_score" : 3.366573
"hits" : [
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8001,
"_score" : 3.366573,
"children": {
"total" : 4,
"hits" : [ // hits on children of this current parent, ordered by score
{
"_id" : 654,
"highlight" : {
"name" : [ "my lovely bike" ]
}
},
{
"_id" : 987,
"highlight" : {
"name" : [ "my nice bike" ]
}
}
]
}
},
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8004,
.... etc....
}

With this structure I guess we cover it all.

Now about the implementation.

First the information about the children needs to be kept along between the time the query is executed and the time the highlight phase happen. The simplest way of doing that seems to have the query implementation holding it. Then later on, the highlight having access to the context and thus to the query, the info on the children/parent association is accessible and ordered by score.
I am starting to think about an interface implemented by both the TopChildrenQuery and the BlockJoinQuery which would provide child info for a parent doc, ordered by score.

We would also need an implementation of FieldQuery which will only lookup to decompose the sub queries of TopChildrenQuery and BlockJoinQuery. Idem, an interface would expose on both queries an accessor to the query on the children.

About the highlighting of the children, the current code if HighlightPhase seems to fit at almost 80%. I guess a refactor is needed to extract from that code a ESHighlighter which would be used by both the parent highlight and the child one.

Since everything is done document by document, I don't see any issue regarding the sharding (I'm quite far from an expert in that area though).

I don't know yet how to properly get the data of the children, but I would probably get a lot of inspiration from the fetch phase.

Since I haven't started to code and I don't know that much elasticsearch code, at least does that makes sense ?

Nicolas

As always, I've been busy before be able to actually starting to work on it. But here it is, I have a working prototype.
You can see it in the branch "children" on my github [1].

It is nicely working with the TopChildrenQuery, but not with the BlockJoinQuery. Whereas it is pretty simple to work with real Lucene document with the parent/child model, it is not with the nested document model.
And since I am interested myself only the TopChildrenQuery, I have not dug further. I would be happy to correctly implement it if I was given some hints.

Any comment on the code I wrote will be greatly appreciated.

If interested, I can make it a pull request on master and also work on patching the doc.

Nicolas

[1] Commits · hibnico/elasticsearch · GitHub

Le 29 août 2011 à 13:20, Nicolas Lalevée a écrit :

No news so I'll start working on it.

Nicolas

Le 22 août 2011 à 17:17, Nicolas Lalevée a écrit :

I am quite interested to use the top_children query but I would need to get the info about the children, mainly their ids and some highlight. I know it is not implemented and I would like to provide an implementation.

First, does anybody is already working on it ?

If not, here is a suggested API.
On the query side, we would specify we want the children to be returned along with their parent data, and optionally some highlighting on the children fields.
{
"query": {
"top_children" : {
"type": "subcontent",
"query" : {
"term" : {
"name" : "bike"
}
}
"score" : "max",
"factor" : 5,
"incremental_factor" : 2
}
},
"children": { // just having this tags means we at least want the ids of the children
"size" : 2, // maximum number of children by parent to return
"full_data" : false, // if true return not just the id of the child but also its data
"highlight" : { // same syntax as for normal queries
"fields" : {
"name" : {}
}
}
}
}

Then as results we would have something like:
"hits" : {
"total" : 7,
"max_score" : 3.366573
"hits" : [
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8001,
"_score" : 3.366573,
"children": {
"total" : 4,
"hits" : [ // hits on children of this current parent, ordered by score
{
"_id" : 654,
"highlight" : {
"name" : [ "my lovely bike" ]
}
},
{
"_id" : 987,
"highlight" : {
"name" : [ "my nice bike" ]
}
}
]
}
},
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8004,
.... etc....
}

With this structure I guess we cover it all.

Now about the implementation.

First the information about the children needs to be kept along between the time the query is executed and the time the highlight phase happen. The simplest way of doing that seems to have the query implementation holding it. Then later on, the highlight having access to the context and thus to the query, the info on the children/parent association is accessible and ordered by score.
I am starting to think about an interface implemented by both the TopChildrenQuery and the BlockJoinQuery which would provide child info for a parent doc, ordered by score.

We would also need an implementation of FieldQuery which will only lookup to decompose the sub queries of TopChildrenQuery and BlockJoinQuery. Idem, an interface would expose on both queries an accessor to the query on the children.

About the highlighting of the children, the current code if HighlightPhase seems to fit at almost 80%. I guess a refactor is needed to extract from that code a ESHighlighter which would be used by both the parent highlight and the child one.

Since everything is done document by document, I don't see any issue regarding the sharding (I'm quite far from an expert in that area though).

I don't know yet how to properly get the data of the children, but I would probably get a lot of inspiration from the fetch phase.

Since I haven't started to code and I don't know that much elasticsearch code, at least does that makes sense ?

Nicolas

Heya, I'll have a look at the implementation, can you squash the commits so
it will be simpler to review it?

2011/9/22 Nicolas Lalevée nicolas.lalevee@hibnet.org

As always, I've been busy before be able to actually starting to work on
it. But here it is, I have a working prototype.
You can see it in the branch "children" on my github [1].

It is nicely working with the TopChildrenQuery, but not with the
BlockJoinQuery. Whereas it is pretty simple to work with real Lucene
document with the parent/child model, it is not with the nested document
model.
And since I am interested myself only the TopChildrenQuery, I have not dug
further. I would be happy to correctly implement it if I was given some
hints.

Any comment on the code I wrote will be greatly appreciated.

If interested, I can make it a pull request on master and also work on
patching the doc.

Nicolas

[1] Commits · hibnico/elasticsearch · GitHub

Le 29 août 2011 à 13:20, Nicolas Lalevée a écrit :

No news so I'll start working on it.

Nicolas

Le 22 août 2011 à 17:17, Nicolas Lalevée a écrit :

I am quite interested to use the top_children query but I would need to
get the info about the children, mainly their ids and some highlight. I know
it is not implemented and I would like to provide an implementation.

First, does anybody is already working on it ?

If not, here is a suggested API.
On the query side, we would specify we want the children to be returned
along with their parent data, and optionally some highlighting on the
children fields.
{
"query": {
"top_children" : {
"type": "subcontent",
"query" : {
"term" : {
"name" : "bike"
}
}
"score" : "max",
"factor" : 5,
"incremental_factor" : 2
}
},
"children": { // just having this tags means we at least want the ids
of the children
"size" : 2, // maximum number of children by parent to return
"full_data" : false, // if true return not just the id of the
child but also its data
"highlight" : { // same syntax as for normal queries
"fields" : {
"name" : {}
}
}
}
}

Then as results we would have something like:
"hits" : {
"total" : 7,
"max_score" : 3.366573
"hits" : [
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8001,
"_score" : 3.366573,
"children": {
"total" : 4,
"hits" : [ // hits on children of this current parent,
ordered by score
{
"_id" : 654,
"highlight" : {
"name" : [ "my lovely bike" ]
}
},
{
"_id" : 987,
"highlight" : {
"name" : [ "my nice bike" ]
}
}
]
}
},
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8004,
.... etc....
}

With this structure I guess we cover it all.

Now about the implementation.

First the information about the children needs to be kept along between
the time the query is executed and the time the highlight phase happen. The
simplest way of doing that seems to have the query implementation holding
it. Then later on, the highlight having access to the context and thus to
the query, the info on the children/parent association is accessible and
ordered by score.
I am starting to think about an interface implemented by both the
TopChildrenQuery and the BlockJoinQuery which would provide child info for a
parent doc, ordered by score.

We would also need an implementation of FieldQuery which will only
lookup to decompose the sub queries of TopChildrenQuery and BlockJoinQuery.
Idem, an interface would expose on both queries an accessor to the query on
the children.

About the highlighting of the children, the current code if
HighlightPhase seems to fit at almost 80%. I guess a refactor is needed to
extract from that code a ESHighlighter which would be used by both the
parent highlight and the child one.

Since everything is done document by document, I don't see any issue
regarding the sharding (I'm quite far from an expert in that area though).

I don't know yet how to properly get the data of the children, but I
would probably get a lot of inspiration from the fetch phase.

Since I haven't started to code and I don't know that much elasticsearch
code, at least does that makes sense ?

Nicolas

done on my master.

cheers,
Nicolas

Le 24 sept. 2011 à 20:16, Shay Banon a écrit :

Heya, I'll have a look at the implementation, can you squash the commits so it will be simpler to review it?

2011/9/22 Nicolas Lalevée nicolas.lalevee@hibnet.org
As always, I've been busy before be able to actually starting to work on it. But here it is, I have a working prototype.
You can see it in the branch "children" on my github [1].

It is nicely working with the TopChildrenQuery, but not with the BlockJoinQuery. Whereas it is pretty simple to work with real Lucene document with the parent/child model, it is not with the nested document model.
And since I am interested myself only the TopChildrenQuery, I have not dug further. I would be happy to correctly implement it if I was given some hints.

Any comment on the code I wrote will be greatly appreciated.

If interested, I can make it a pull request on master and also work on patching the doc.

Nicolas

[1] Commits · hibnico/elasticsearch · GitHub

Le 29 août 2011 à 13:20, Nicolas Lalevée a écrit :

No news so I'll start working on it.

Nicolas

Le 22 août 2011 à 17:17, Nicolas Lalevée a écrit :

I am quite interested to use the top_children query but I would need to get the info about the children, mainly their ids and some highlight. I know it is not implemented and I would like to provide an implementation.

First, does anybody is already working on it ?

If not, here is a suggested API.
On the query side, we would specify we want the children to be returned along with their parent data, and optionally some highlighting on the children fields.
{
"query": {
"top_children" : {
"type": "subcontent",
"query" : {
"term" : {
"name" : "bike"
}
}
"score" : "max",
"factor" : 5,
"incremental_factor" : 2
}
},
"children": { // just having this tags means we at least want the ids of the children
"size" : 2, // maximum number of children by parent to return
"full_data" : false, // if true return not just the id of the child but also its data
"highlight" : { // same syntax as for normal queries
"fields" : {
"name" : {}
}
}
}
}

Then as results we would have something like:
"hits" : {
"total" : 7,
"max_score" : 3.366573
"hits" : [
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8001,
"_score" : 3.366573,
"children": {
"total" : 4,
"hits" : [ // hits on children of this current parent, ordered by score
{
"_id" : 654,
"highlight" : {
"name" : [ "my lovely bike" ]
}
},
{
"_id" : 987,
"highlight" : {
"name" : [ "my nice bike" ]
}
}
]
}
},
{
"_index" : "es",
"_type" : "twitter",
"_id" : 8004,
.... etc....
}

With this structure I guess we cover it all.

Now about the implementation.

First the information about the children needs to be kept along between the time the query is executed and the time the highlight phase happen. The simplest way of doing that seems to have the query implementation holding it. Then later on, the highlight having access to the context and thus to the query, the info on the children/parent association is accessible and ordered by score.
I am starting to think about an interface implemented by both the TopChildrenQuery and the BlockJoinQuery which would provide child info for a parent doc, ordered by score.

We would also need an implementation of FieldQuery which will only lookup to decompose the sub queries of TopChildrenQuery and BlockJoinQuery. Idem, an interface would expose on both queries an accessor to the query on the children.

About the highlighting of the children, the current code if HighlightPhase seems to fit at almost 80%. I guess a refactor is needed to extract from that code a ESHighlighter which would be used by both the parent highlight and the child one.

Since everything is done document by document, I don't see any issue regarding the sharding (I'm quite far from an expert in that area though).

I don't know yet how to properly get the data of the children, but I would probably get a lot of inspiration from the fetch phase.

Since I haven't started to code and I don't know that much elasticsearch code, at least does that makes sense ?

Nicolas

Hey Nicolas, i compiled your 0.18 snapshot and got it to work in principle.

If I curl a query like you proposed and there the search string doesn't
match anything, everything is fine and i just get an empty result.
But in contrast, if I search for something existing, ES throws a
Nullpointer exception instead of showing any result:

  • "error" : "SearchPhaseExecutionException[Failed to execute phase
    [query_fetch], total failure; shardFailures {NullPointerException[null]}]",
    "status" : 500*

Looks like some kind of a bug to me because the rest is working fine.

I can attach the query if you want it.

Thanks for your help!

Sorry for the late answer, I didn't saw you mail before today.

Regularly I rebase my work on the master of elasticsearch. I did more testing today and it works fine. I am starting to be confident in the current implementation, I am starting to integrate it in our webapp, so probably it will go in prod.

Nicolas

Le 25 oct. 2011 à 23:55, Frifri a écrit :

Hey Nicolas, i compiled your 0.18 snapshot and got it to work in principle.

If I curl a query like you proposed and there the search string doesn't match anything, everything is fine and i just get an empty result.
But in contrast, if I search for something existing, ES throws a Nullpointer exception instead of showing any result:

"error" : "SearchPhaseExecutionException[Failed to execute phase [query_fetch], total failure; shardFailures {NullPointerException[null]}]",
"status" : 500

Looks like some kind of a bug to me because the rest is working fine.

I can attach the query if you want it.

Thanks for your help!

Is this likely to make it into the official version? (looks very useful)

It is very useful. The implementation look good, the main problem is that
its very memory intensive. Not to say that I have a better solution without
a lot of work, but for now, its not good enough for the general use case
due to its overhead.

On Sun, Jan 8, 2012 at 5:08 PM, Nick Sellen talktome@nicksellen.co.ukwrote:

Is this likely to make it into the official version? (looks very useful)

To anyone who was following my branch, after some intense "marketing" debate this feature won't be useful for me.
I'll still continue to rebase it regularly on the master and then in the incoming 0.19.x branch, but that would be probably all.

cheers,
Nicolas

Le 9 janv. 2012 à 21:03, Shay Banon a écrit :

It is very useful. The implementation look good, the main problem is that its very memory intensive. Not to say that I have a better solution without a lot of work, but for now, its not good enough for the general use case due to its overhead.

On Sun, Jan 8, 2012 at 5:08 PM, Nick Sellen talktome@nicksellen.co.uk wrote:
Is this likely to make it into the official version? (looks very useful)