Large (stored) fields, json source and highlighting


(Tomislav Poljak) #1

Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Shay Banon) #2

Agreed. A bit tricky to implement, but possible. Also, note that this will
require loading the full json from the index, and parse it in order to get
the relevant parts from it. It won't be returned, but still loaded. So, it
might not make sense when you have several big fields in the index, and you
want to get fragments for one of them. But does make sense when having one
big field.

Also, if this is implemented, it should also be possible to get specific
fields out of the json as a response as well (similar to asking for specific
fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #3

If I read it correctly then I think it partly overlaps with

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. A bit tricky to implement, but possible. Also, note that this will
require loading the full json from the index, and parse it in order to get
the relevant parts from it. It won't be returned, but still loaded. So, it
might not make sense when you have several big fields in the index, and you
want to get fragments for one of them. But does make sense when having one
big field.

Also, if this is implemented, it should also be possible to get specific
fields out of the json as a response as well (similar to asking for specific
fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #4

One of differences is that the 308 issue was meant to return whole content
of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Agreed. A bit tricky to implement, but possible. Also, note that this will
require loading the full json from the index, and parse it in order to get
the relevant parts from it. It won't be returned, but still loaded. So, it
might not make sense when you have several big fields in the index, and you
want to get fragments for one of them. But does make sense when having one
big field.

Also, if this is implemented, it should also be possible to get specific
fields out of the json as a response as well (similar to asking for specific
fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Shay Banon) #5

Not sure if it overlaps, fragmenter controls how to break the highlighted
data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

One of differences is that the 308 issue was meant to return whole content
of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Agreed. A bit tricky to implement, but possible. Also, note that this
will require loading the full json from the index, and parse it in order to
get the relevant parts from it. It won't be returned, but still loaded. So,
it might not make sense when you have several big fields in the index, and
you want to get fragments for one of them. But does make sense when having
one big field.

Also, if this is implemented, it should also be possible to get specific
fields out of the json as a response as well (similar to asking for specific
fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large textual
fields (like 'content' with text in Mb, which is not unusual), returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #6

Actually, that ticket has two parts. One is Fragmenter related and the other
one is possibility to tell, that I want to highlight some portion of _source
data. Imagine I am using only REST API and for example if _source is a
person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Not sure if it overlaps, fragmenter controls how to break the highlighted
data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole content
of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that this
will require loading the full json from the index, and parse it in order to
get the relevant parts from it. It won't be returned, but still loaded. So,
it might not make sense when you have several big fields in the index, and
you want to get fragments for one of them. But does make sense when having
one big field.

Also, if this is implemented, it should also be possible to get specific
fields out of the json as a response as well (similar to asking for specific
fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides
in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used for
highlighting in such case) then whole document (all fields) needs to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be retrieved
use (newly added) fields=* feature (I've read similar discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Shay Banon) #7

So, what you want is to be able to get just the bio field, without the full
source, and without the bio field being stored? If so, then the response I
gave, where the logic might apply also to get fields using something like
"source_field" notion applies here. It does mean that the full source will
need to be retrieved and parsed. Not sure how highlighting comes into play
here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Not sure if it overlaps, fragmenter controls how to break the highlighted
data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that this
will require loading the full json from the index, and parse it in order to
get the relevant parts from it. It won't be returned, but still loaded. So,
it might not make sense when you have several big fields in the index, and
you want to get fragments for one of them. But does make sense when having
one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides
in
both Java API (used it in indexing/updating) and REST API (used it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time
with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we
need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used
for
highlighting in such case) then whole document (all fields) needs to
be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #8

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like
to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the full
source, and without the bio field being stored? If so, then the response I
gave, where the logic might apply also to get fields using something like
"source_field" notion applies here. It does mean that the full source will
need to be retrieved and parsed. Not sure how highlighting comes into play
here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <shay.banon@elasticsearch.com

wrote:

Not sure if it overlaps, fragmenter controls how to break the highlighted
data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that this
will require loading the full json from the index, and parse it in order to
get the relevant parts from it. It won't be returned, but still loaded. So,
it might not make sense when you have several big fields in the index, and
you want to get fragments for one of them. But does make sense when having
one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source) provides
in
both Java API (used it in indexing/updating) and REST API (used it
for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time
with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big (makes
impassible to return json source for 30-50 results to a client) we
need
to store it twice in index (once as a part of original json source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used
for
highlighting in such case) then whole document (all fields) needs to
be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Shay Banon) #9

Ahh, I see. So you would still need to provide a query to the GET api in
order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like
to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the full

source, and without the bio field being stored? If so, then the response I
gave, where the logic might apply also to get fields using something like
"source_field" notion applies here. It does mean that the full source will
need to be retrieved and parsed. Not sure how highlighting comes into play
here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that this
will require loading the full json from the index, and parse it in order to
get the relevant parts from it. It won't be returned, but still loaded. So,
it might not make sense when you have several big fields in the index, and
you want to get fragments for one of them. But does make sense when having
one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak tpoljak@gmail.comwrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used it
for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if
each
json document has a few Mb in 'content' returning 30-50 results to a
client doesn't sound realistic or even possible in acceptable time
with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client) we
need
to store it twice in index (once as a part of original json source
in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used
for
highlighting in such case) then whole document (all fields) needs to
be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust' or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #10

If I want to display whole bio highlighted then I can either get "_source"
and cut bio from it on the client side but in this case I need to tell ES to
use highlighting on it first. Or I need to specify in mapping that bio is
also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api in
order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would like
to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the

full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that
this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <tpoljak@gmail.com

wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used it
for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if
each
json document has a few Mb in 'content' returning 30-50 results to
a
client doesn't sound realistic or even possible in acceptable time
with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested
by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client) we
need
to store it twice in index (once as a part of original json source
in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be used
for
highlighting in such case) then whole document (all fields) needs
to be
stored twice.

In this case seems only logical to disable _source field (since all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for 'field
specific' highlighting, where matching snippet needs to be returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust'
or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #11

Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

If I want to display whole bio highlighted then I can either get "_source"
and cut bio from it on the client side but in this case I need to tell ES to
use highlighting on it first. Or I need to specify in mapping that bio is
also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole
body, not fragments, thus my reference to NullFragmenter, which is not
implemented in FastVectorHighlighter API (as far as I understand it), it can
be found in the older Highlighting API, thus I opened also

May be it would be better if the NullFragmenter-like functionality is
contributed directly into Lucene FastVectorHighlighter API. I was looking at
the FVH API today and I think I can try to implement such Fragmenter.

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api in
order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would
like to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the

full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that
this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <
tpoljak@gmail.com> wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used it
for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if
each
json document has a few Mb in 'content' returning 30-50 results to
a
client doesn't sound realistic or even possible in acceptable time
with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested
by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client) we
need
to store it twice in index (once as a part of original json source
in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be
used for
highlighting in such case) then whole document (all fields) needs
to be
stored twice.

In this case seems only logical to disable _source field (since
all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for
'field
specific' highlighting, where matching snippet needs to be
returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust'
or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring an
individual field to be (separately) stored, that would make a
great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Shay Banon) #12

ok, so you want to get the whole bio field highlighted, so you would need to
pass the query to the get API as well, otherwise, there is no way to
highlight it (aside from other things you need, like the option to do no
fragmentation and getting the actual data).

On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I want to display whole bio highlighted then I can either get "_source"
and cut bio from it on the client side but in this case I need to tell ES to
use highlighting on it first. Or I need to specify in mapping that bio is
also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole
body, not fragments, thus my reference to NullFragmenter, which is not
implemented in FastVectorHighlighter API (as far as I understand it), it can
be found in the older Highlighting API, thus I opened also
http://github.com/elasticsearch/elasticsearch/issues/issue/307

May be it would be better if the NullFragmenter-like functionality is
contributed directly into Lucene FastVectorHighlighter API. I was looking at
the FVH API today and I think I can try to implement such Fragmenter.

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api in
order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would
like to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the

full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and the
other one is possibility to tell, that I want to highlight some portion of
_source data. Imagine I am using only REST API and for example if _source is
a person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <lukas.vlcek@gmail.com

wrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that
this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <
tpoljak@gmail.com> wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used
it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if
each
json document has a few Mb in 'content' returning 30-50 results
to a
client doesn't sound realistic or even possible in acceptable
time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and retrieve
whole
document (json source) only for a single document (when requested
by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client)
we need
to store it twice in index (once as a part of original json
source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match occurred,
without mixing fields snippets -> stored _all field can not be
used for
highlighting in such case) then whole document (all fields) needs
to be
stored twice.

In this case seems only logical to disable _source field (since
all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for
'field
specific' highlighting, where matching snippet needs to be
returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow 'adjust'
or
recalculate positions_offsets to point to text snippet in _source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring
an
individual field to be (separately) stored, that would make a
great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #13

Yes, I did not realize that earlier but you are right that I will need to
pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I
want to allow user to click on individual name and get whole bio
highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for
"dude java". I can now display names of candidates without highlights and
highlighted fragments from bio for each name, kind of basic search interface
that already works now. But if I wanted to display highlighted name I would
get something like "..e Dude Abid..." which is not what I want
(sure, I can work with fragment size but that is just workaround and does
not fit all situations). So when using that "query_string" query I would
like to specify in the highlight section that the person.name should be
highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get whole
bio highlighted.
So I need to get specific document (by ID) and have the bio
field highlighted (and the name field as well)
The example of the query that could be used:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"highlight" : {
"fields" : {
"_source" : {
"path" : "person.bio,person.name",
"fragmenter" : "classpath.to.NullFragmenter",
"query" : {
"query_string" : { "fields" : ["bio","name"], "query" : "dude
java" }
}
}
}
}
}'

or I could use fields query:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"fields" : ["bio","name"],
"highlight" : {
"fields" : {
"bio" : {
"query" : {
"query_string" : { "fields" : ["bio"], "query" : "dude java" }
}
},
"name" : {
"query" : {
"query_string" : { "fields" : ["name"], "query" : "dude java" }
}
}
}
}
}'

The later query requires both bio and name to be stored (and this is where
it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear now
:slight_smile:

Regards,
Lukas

2010/8/13 Shay Banon shay.banon@elasticsearch.com

ok, so you want to get the whole bio field highlighted, so you would need
to pass the query to the get API as well, otherwise, there is no way to
highlight it (aside from other things you need, like the option to do no
fragmentation and getting the actual data).

On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I want to display whole bio highlighted then I can either get
"_source" and cut bio from it on the client side but in this case I need to
tell ES to use highlighting on it first. Or I need to specify in mapping
that bio is also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

And the later also requires specification of Fragmenter that returns whole
body, not fragments, thus my reference to NullFragmenter, which is not
implemented in FastVectorHighlighter API (as far as I understand it), it can
be found in the older Highlighting API, thus I opened also
http://github.com/elasticsearch/elasticsearch/issues/issue/307

May be it would be better if the NullFragmenter-like functionality is
contributed directly into Lucene FastVectorHighlighter API. I was looking at
the FVH API today and I think I can try to implement such Fragmenter.

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api in
order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would
like to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the

full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and
the other one is possibility to tell, that I want to highlight some portion
of _source data. Imagine I am using only REST API and for example if _source
is a person with name, address and bio fields then I would like to tell that
I want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <lukas.vlcek@gmail.com

wrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that
this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <
tpoljak@gmail.com> wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used
it for
searching).

I do however have one possible request for improvement regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with large
textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible (if
each
json document has a few Mb in 'content' returning 30-50 results
to a
client doesn't sound realistic or even possible in acceptable
time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and
retrieve whole
document (json source) only for a single document (when
requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client)
we need
to store it twice in index (once as a part of original json
source in
_source and second time as a 'content' field "store" : "yes" for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match
occurred,
without mixing fields snippets -> stored _all field can not be
used for
highlighting in such case) then whole document (all fields)
needs to be
stored twice.

In this case seems only logical to disable _source field (since
all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar discussion
thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for
'field
specific' highlighting, where matching snippet needs to be
returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow
'adjust' or
recalculate positions_offsets to point to text snippet in
_source
instead of stored field?

I know this is not a simple requirement, but if 'field specific'
highlighting could somehow use stored json instead of requiring
an
individual field to be (separately) stored, that would make a
great
use/reuse of stored json _source (and no one would ever think of
disabling it :slight_smile:

Tomislav


(Lukáš Vlček) #14

On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Yes, I did not realize that earlier but you are right that I will need to
pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I
want to allow user to click on individual name and get whole bio
highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for
"dude java". I can now display names of candidates without highlights and
highlighted fragments from bio for each name, kind of basic search interface
that already works now. But if I wanted to display highlighted name I would
get something like "..e Dude Abid..." which is not what I want
(sure, I can work with fragment size but that is just workaround and does
not fit all situations). So when using that "query_string" query I would
like to specify in the highlight section that the person.name should be
highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get whole
bio highlighted.
So I need to get specific document (by ID) and have the bio
field highlighted (and the name field as well)
The example of the query that could be used:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"highlight" : {
"fields" : {
"_source" : {
"path" : "person.bio,person.name",
"fragmenter" : "classpath.to.NullFragmenter",
"query" : {
"query_string" : { "fields" : ["bio","name"], "query" : "dude
java" }
}
}
}
}
}'

or I could use fields query:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"fields" : ["bio","name"],
"highlight" : {
"fields" : {
"bio" : {
"query" : {
"query_string" : { "fields" : ["bio"], "query" : "dude java" }
}
},
"name" : {
"query" : {
"query_string" : { "fields" : ["name"], "query" : "dude java" }
}
}
}
}
}'

The later query requires both bio and name to be stored (and this is where
it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear now
:slight_smile:

Sure I am complicating it too much because in the later query example I
forgot the specify NullFragmenter :slight_smile:

Regards,
Lukas

2010/8/13 Shay Banon shay.banon@elasticsearch.com

ok, so you want to get the whole bio field highlighted, so you would need

to pass the query to the get API as well, otherwise, there is no way to
highlight it (aside from other things you need, like the option to do no
fragmentation and getting the actual data).

On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I want to display whole bio highlighted then I can either get
"_source" and cut bio from it on the client side but in this case I need to
tell ES to use highlighting on it first. Or I need to specify in mapping
that bio is also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

And the later also requires specification of Fragmenter that returns
whole body, not fragments, thus my reference to NullFragmenter, which is not
implemented in FastVectorHighlighter API (as far as I understand it), it can
be found in the older Highlighting API, thus I opened also
http://github.com/elasticsearch/elasticsearch/issues/issue/307

May be it would be better if the NullFragmenter-like functionality is
contributed directly into Lucene FastVectorHighlighter API. I was looking at
the FVH API today and I think I can try to implement such Fragmenter.

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api
in order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would
like to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without the

full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Actually, that ticket has two parts. One is Fragmenter related and
the other one is possibility to tell, that I want to highlight some portion
of _source data. Imagine I am using only REST API and for example if _source
is a person with name, address and bio fields then I would like to tell that
I want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:

One of differences is that the 308 issue was meant to return whole
content of the _source or some of its fields (or stored fields if not using
"_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note that
this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <
tpoljak@gmail.com> wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API (used
it for
searching).

I do however have one possible request for improvement
regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with
large textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible
(if each
json document has a few Mb in 'content' returning 30-50 results
to a
client doesn't sound realistic or even possible in acceptable
time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and
retrieve whole
document (json source) only for a single document (when
requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a client)
we need
to store it twice in index (once as a part of original json
source in
_source and second time as a 'content' field "store" : "yes"
for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for all
fields
(separate highlight snippets for each field where match
occurred,
without mixing fields snippets -> stored _all field can not be
used for
highlighting in such case) then whole document (all fields)
needs to be
stored twice.

In this case seems only logical to disable _source field (since
all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar
discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for
'field
specific' highlighting, where matching snippet needs to be
returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow
'adjust' or
recalculate positions_offsets to point to text snippet in
_source
instead of stored field?

I know this is not a simple requirement, but if 'field
specific'
highlighting could somehow use stored json instead of requiring
an
individual field to be (separately) stored, that would make a
great
use/reuse of stored json _source (and no one would ever think
of
disabling it :slight_smile:

Tomislav


(Shay Banon) #15

No problem, I understand the general idea of the requirement.

-shay.banon

On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Yes, I did not realize that earlier but you are right that I will need to
pass query into the highlight section as well.
Take the following example:

I need to display all candidates that match "dude java" query and then I
want to allow user to click on individual name and get whole bio
highlighted.

So how I can go about this:
First, I can get relevant documents using simple "query_string" query for
"dude java". I can now display names of candidates without highlights and
highlighted fragments from bio for each name, kind of basic search interface
that already works now. But if I wanted to display highlighted name I would
get something like "..e Dude Abid..." which is not what I want
(sure, I can work with fragment size but that is just workaround and does
not fit all situations). So when using that "query_string" query I would
like to specify in the highlight section that the person.name should be
highlighted with no fragments.

Second, now, when the user clicks individual name, then I want to get
whole bio highlighted.
So I need to get specific document (by ID) and have the bio
field highlighted (and the name field as well)
The example of the query that could be used:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"highlight" : {
"fields" : {
"_source" : {
"path" : "person.bio,person.name",
"fragmenter" : "classpath.to.NullFragmenter",
"query" : {
"query_string" : { "fields" : ["bio","name"], "query" : "dude
java" }
}
}
}
}
}'

or I could use fields query:

curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"fields" : ["bio","name"],
"highlight" : {
"fields" : {
"bio" : {
"query" : {
"query_string" : { "fields" : ["bio"], "query" : "dude java" }
}
},
"name" : {
"query" : {
"query_string" : { "fields" : ["name"], "query" : "dude java" }
}
}
}
}
}'

The later query requires both bio and name to be stored (and this is where
it gets back to Tomislav's original point I think).
Ugh! I am complicating it way too much... but hope the request is clear
now :slight_smile:

Sure I am complicating it too much because in the later query example I
forgot the specify NullFragmenter :slight_smile:

Regards,
Lukas

2010/8/13 Shay Banon shay.banon@elasticsearch.com

ok, so you want to get the whole bio field highlighted, so you would need

to pass the query to the get API as well, otherwise, there is no way to
highlight it (aside from other things you need, like the option to do no
fragmentation and getting the actual data).

On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Oh, and one more note, see below:

On Thu, Aug 12, 2010 at 9:22 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

If I want to display whole bio highlighted then I can either get
"_source" and cut bio from it on the client side but in this case I need to
tell ES to use highlighting on it first. Or I need to specify in mapping
that bio is also stored and use fields query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but
again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

And the later also requires specification of Fragmenter that returns
whole body, not fragments, thus my reference to NullFragmenter, which is not
implemented in FastVectorHighlighter API (as far as I understand it), it can
be found in the older Highlighting API, thus I opened also
http://github.com/elasticsearch/elasticsearch/issues/issue/307

May be it would be better if the NullFragmenter-like functionality is
contributed directly into Lucene FastVectorHighlighter API. I was looking at
the FVH API today and I think I can try to implement such Fragmenter.

Hope this makes it clear. (Sorry if I confused you).

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

Ahh, I see. So you would still need to provide a query to the GET api
in order to do the highlighting, right?

On Thu, Aug 12, 2010 at 10:08 PM, Lukáš Vlček lukas.vlcek@gmail.comwrote:

Imagine search app for HR: Candidate catalog (cool name!).
The entities stored in the index are as follows: person: { id, name,
address, bio }
Now I am using just the REST API. Say I search for "Java" and I would
like to display list of Names and allow users to click individual name which
would display whole bio with Java highlighted in it (here comes the
highlighting in the play!). Now I can display bio (just using GET REST API
with given document ID but not highlighted. So I was thinking that it would
be cool to have this function.

Lukas

2010/8/12 Shay Banon shay.banon@elasticsearch.com

So, what you want is to be able to get just the bio field, without

the full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

On Thu, Aug 12, 2010 at 9:53 PM, Lukáš Vlček <lukas.vlcek@gmail.com

wrote:

Actually, that ticket has two parts. One is Fragmenter related and
the other one is possibility to tell, that I want to highlight some portion
of _source data. Imagine I am using only REST API and for example if _source
is a person with name, address and bio fields then I would like to tell that
I want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.

-shay.banon

On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:

One of differences is that the 308 issue was meant to return
whole content of the _source or some of its fields (or stored fields if not
using "_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:

If I read it correctly then I think it partly overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

http://github.com/elasticsearch/elasticsearch/issues/issue/308
Lukas

On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:

Agreed. A bit tricky to implement, but possible. Also, note
that this will require loading the full json from the index, and parse it in
order to get the relevant parts from it. It won't be returned, but still
loaded. So, it might not make sense when you have several big fields in the
index, and you want to get fragments for one of them. But does make sense
when having one big field.

Also, if this is implemented, it should also be possible to get
specific fields out of the json as a response as well (similar to asking for
specific fields in the search request, maybe call them source_fileds).

Open an issue for this?

-shay.banon

On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <
tpoljak@gmail.com> wrote:

Hi,
I really like all the features stored json (enabled _source)
provides in
both Java API (used it in indexing/updating) and REST API
(used it for
searching).

I do however have one possible request for improvement
regarding
documents with large textual fields and overall highlighting.

When there is a requirement to index/search documents with
large textual
fields (like 'content' with text in Mb, which is not unusual),
returning
a whole json for each result in result set can be impossible
(if each
json document has a few Mb in 'content' returning 30-50
results to a
client doesn't sound realistic or even possible in acceptable
time with
usual 'Internet' bandwidth).

But, usually it's acceptable (or even requirement) to
display/return
only highlighting snippets for 30-50 results matches and
retrieve whole
document (json source) only for a single document (when
requested by
exact ID).

To be able to provide highlighting snippets for large textual
('content') field, it needs to be stored.

Now we are in situation where because textual field is too big
(makes
impassible to return json source for 30-50 results to a
client) we need
to store it twice in index (once as a part of original json
source in
_source and second time as a 'content' field "store" : "yes"
for
highlighting). This make index a lot bigger.

Also, if there is a requirement to display highlighting for
all fields
(separate highlight snippets for each field where match
occurred,
without mixing fields snippets -> stored _all field can not be
used for
highlighting in such case) then whole document (all fields)
needs to be
stored twice.

In this case seems only logical to disable _source field
(since all
fields are stored anyway) and when whole document needs to be
retrieved
use (newly added) fields=* feature (I've read similar
discussion thread
which led to this enhancement)

Here is my question/proposal:

would it be possible to enable use of json _source field for
'field
specific' highlighting, where matching snippet needs to be
returned
separately for each field name?

Maybe to have term_vector for each field, but to somehow
'adjust' or
recalculate positions_offsets to point to text snippet in
_source
instead of stored field?

I know this is not a simple requirement, but if 'field
specific'
highlighting could somehow use stored json instead of
requiring an
individual field to be (separately) stored, that would make a
great
use/reuse of stored json _source (and no one would ever think
of
disabling it :slight_smile:

Tomislav


(Tomislav Poljak) #16

Hi,
I'm not sure I fully understand what will be implement as a
result/conclusion of discussion here, but I think I can define what I
would like to be implemented (from my point of view) pretty clearly as:

It would be great if ES, beside returning whole document source (in json
format) in search results, supported returning json type structure with
'matching fields' and/or requested fields. Only fields which are matched
by a query or requested would be returned (from _source json) and this
would be possible without storing each field separately. Value in these
fields would be either a whole field value (with highlighting applied)
or a highlighting snippet (for large textual fields).

Will something like that be possible?

Thanks,
Tomislav

On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:

No problem, I understand the general idea of the requirement.

-shay.banon

On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček lukas.vlcek@gmail.com
wrote:

    On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
    <lukas.vlcek@gmail.com> wrote:
            Yes, I did not realize that earlier but you are right
            that I will need to pass query into the highlight
            section as well.
            Take the following example:
            
            
            I need to display all candidates that match "dude
            java" query and then I want to allow user to click on
            individual name and get whole bio highlighted.
            
            
            So how I can go about this:
            First, I can get relevant documents using simple
            "query_string" query for "dude java". I can now
            display names of candidates without highlights and
            highlighted fragments from bio for each name, kind of
            basic search interface that already works now. But if
            I wanted to display highlighted name I would get
            something like "..e <em>Dude</em> Abid..." which is
            not what I want (sure, I can work with fragment size
            but that is just workaround and does not fit all
            situations). So when using that "query_string" query I
            would like to specify in the highlight section that
            the person.name should be highlighted with no
            fragments.
            
            
            Second, now, when the user clicks individual name,
            then I want to get whole bio highlighted.
            So I need to get specific document (by ID) and have
            the bio field highlighted (and the name field as well)
            The example of the query that could be used:
            
            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "highlight" : {
                "fields" : {
                  "_source" : {
                    "path" : "person.bio,person.name",
                    "fragmenter" : "classpath.to.NullFragmenter",
                    "query" : {
                      "query_string" : { "fields" :
            ["bio","name"], "query" : "dude java" }
                    }
                  }
                }
              }
            }'
            
            
            or I could use fields query:
            
            
            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "fields" : ["bio","name"],
              "highlight" : {
                "fields" : {
                  "bio" : {
                    "query" : {
                      "query_string" : { "fields" : ["bio"],
            "query" : "dude java" }
                    }
                  },
                  "name" : {
                    "query" : {
                      "query_string" : { "fields" : ["name"],
            "query" : "dude java" }
                    }
                  }
                }
              }
            }'
            
            
            The later query requires both bio and name to be
            stored (and this is where it gets back to Tomislav's
            original point I think).
            Ugh! I am complicating it way too much... but hope the
            request is clear now :-)
    
    
    Sure I am complicating it too much because in the later query
    example I forgot the specify NullFragmenter :-)
    
     
            
            
            Regards,
            Lukas
            
            
            
            2010/8/13 Shay Banon <shay.banon@elasticsearch.com>
            
            
                    ok, so you want to get the whole bio field
                    highlighted, so you would need to pass the
                    query to the get API as well, otherwise, there
                    is no way to highlight it (aside from other
                    things you need, like the option to do no
                    fragmentation and getting the actual data).
                    
                    
                    
                    On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
                    <lukas.vlcek@gmail.com> wrote:
                            Oh, and one more note, see below:
                            
                            On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
                            Vlček <lukas.vlcek@gmail.com> wrote:
                                    If I want to display whole bio
                                    highlighted then I can either
                                    get "_source" and cut bio from
                                    it on the client side but in
                                    this case I need to tell ES to
                                    use highlighting on it first.
                                    Or I need to specify in
                                    mapping that bio is also
                                    stored and use fields
                                    query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
                            
                            
                            And the later also requires
                            specification of Fragmenter that
                            returns whole body, not fragments,
                            thus my reference to NullFragmenter,
                            which is not implemented in
                            FastVectorHighlighter API (as far as I
                            understand it), it can be found in the
                            older Highlighting API, thus I opened
                            also http://github.com/elasticsearch/elasticsearch/issues/issue/307 
                            
                            
                            May be it would be better if the
                            NullFragmenter-like functionality is
                            contributed directly into Lucene
                            FastVectorHighlighter API. I was
                            looking at the FVH API today and I
                            think I can try to implement such
                            Fragmenter.
                            
                            
                            
                                    
                                    
                                    Hope this makes it clear.
                                    (Sorry if I confused you).
                                    
                                    
                                    
                                    Lukas
                                    
                                    2010/8/12 Shay Banon
                                    <shay.banon@elasticsearch.com>
                                            Ahh, I see. So you
                                            would still need to
                                            provide a query to the
                                            GET api in order to do
                                            the highlighting,
                                            right?
                                            
                                            
                                            
                                            On Thu, Aug 12, 2010
                                            at 10:08 PM, Lukáš
                                            Vlček
                                            <lukas.vlcek@gmail.com> wrote:
                                                    Imagine search
                                                    app for HR:
                                                    Candidate
                                                    catalog (cool
                                                    name!).
                                                    The entities
                                                    stored in the
                                                    index are as
                                                    follows:
                                                    person: { id,
                                                    name, address,
                                                    bio }
                                                    Now I am using
                                                    just the REST
                                                    API. Say I
                                                    search for
                                                    "Java" and I
                                                    would like to
                                                    display list
                                                    of Names and
                                                    allow users to
                                                    click
                                                    individual
                                                    name which
                                                    would display
                                                    whole bio with
                                                    Java
                                                    highlighted in
                                                    it (here comes
                                                    the
                                                    highlighting
                                                    in the play!).
                                                    Now I can
                                                    display bio
                                                    (just using
                                                    GET REST API
                                                    with given
                                                    document ID
                                                    but not
                                                    highlighted.
                                                    So I was
                                                    thinking that
                                                    it would be
                                                    cool to have
                                                    this function.
                                                    
                                                    
                                                    Lukas
                                                    
                                                    2010/8/12 Shay
                                                    Banon
                                                    <shay.banon@elasticsearch.com>
                                                    
                                                    
                                                            So,
                                                            what
                                                            you
                                                            want
                                                            is to
                                                            be
                                                            able
                                                            to get
                                                            just
                                                            the
                                                            bio
                                                            field,
                                                            without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...
                                                            
                                                            
                                                            -shay.banon
                                                            
                                                            
                                                            
                                                            On
                                                            Thu,
                                                            Aug
                                                            12,
                                                            2010
                                                            at
                                                            9:53
                                                            PM,
                                                            Lukáš
                                                            Vlček
                                                            <lukas.vlcek@gmail.com> wrote:
                                                                    Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.
                                                                    
                                                                    
                                                                    Lukas
                                                                    
                                                                    
                                                                    
                                                                    On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <shay.banon@elasticsearch.com> wrote:
                                                                            Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.
                                                                            
                                                                            
                                                                            -shay.banon
                                                                            
                                                                            
                                                                            
                                                                            On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <lukas.vlcek@gmail.com> wrote:
                                                                                    One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).
                                                                                    
                                                                                    
                                                                                    
                                                                                    On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <lukas.vlcek@gmail.com> wrote:
                                                                                            If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308
                                                                                            
                                                                                            
                                                                                            Lukas
                                                                                            
                                                                                            
                                                                                            
                                                                                            On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <shay.banon@elasticsearch.com> wrote:
                                                                                                    Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.
                                                                                                    
                                                                                                    
                                                                                                    Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).
                                                                                                    
                                                                                                    
                                                                                                    Open an issue for this?
                                                                                                    
                                                                                                    
                                                                                                    -shay.banon
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <tpoljak@gmail.com> wrote:
                                                                                                            Hi,
                                                                                                            I really like all the features stored json (enabled _source) provides in
                                                                                                            both Java API (used it in indexing/updating) and REST API (used it for
                                                                                                            searching).
                                                                                                            
                                                                                                            I do however have one possible request for improvement regarding
                                                                                                            documents with large textual fields and overall highlighting.
                                                                                                            
                                                                                                            When there is a requirement to index/search documents with large textual
                                                                                                            fields (like 'content' with text in Mb, which is not unusual), returning
                                                                                                            a whole json for each result in result set can be impossible (if each
                                                                                                            json document has a few Mb in 'content' returning 30-50 results to a
                                                                                                            client doesn't sound realistic or even possible in acceptable time with
                                                                                                            usual 'Internet' bandwidth).
                                                                                                            
                                                                                                            But, usually it's acceptable (or even requirement) to display/return
                                                                                                            only highlighting snippets for 30-50 results matches and retrieve whole
                                                                                                            document (json source) only for a single document (when requested by
                                                                                                            exact ID).
                                                                                                            
                                                                                                            To be able to provide highlighting snippets for large textual
                                                                                                            ('content') field, it needs to be stored.
                                                                                                            
                                                                                                            Now we are in situation where because textual field is too big (makes
                                                                                                            impassible to return json source for 30-50 results to a client) we need
                                                                                                            to store it twice in index (once as a part of original json source in
                                                                                                            _source and second time as a 'content' field "store" : "yes" for
                                                                                                            highlighting). This make index a lot bigger.
                                                                                                            
                                                                                                            Also, if there is a requirement to display highlighting for all fields
                                                                                                            (separate highlight snippets for each field where match occurred,
                                                                                                            without mixing fields snippets -> stored _all field can not be used for
                                                                                                            highlighting in such case) then whole document (all fields) needs to be
                                                                                                            stored twice.
                                                                                                            
                                                                                                            In this case seems only logical to disable _source field (since all
                                                                                                            fields are stored anyway) and when whole document needs to be retrieved
                                                                                                            use (newly added) fields=* feature (I've read similar discussion thread
                                                                                                            which led to this enhancement)
                                                                                                            
                                                                                                            Here is my question/proposal:
                                                                                                            
                                                                                                            would it be possible to enable use of json _source field for 'field
                                                                                                            specific' highlighting, where matching snippet needs to be returned
                                                                                                            separately for each field name?
                                                                                                            
                                                                                                            Maybe to have term_vector for each field, but to somehow 'adjust' or
                                                                                                            recalculate positions_offsets to point to text snippet in _source
                                                                                                            instead of stored field?
                                                                                                            
                                                                                                            I know this is not a simple requirement, but if 'field specific'
                                                                                                            highlighting could somehow use stored json instead of requiring an
                                                                                                            individual field to be (separately) stored, that would make a great
                                                                                                            use/reuse of stored json _source (and no one would ever think of
                                                                                                            disabling it :)
                                                                                                            
                                                                                                            Tomislav

(Tomislav Poljak) #17

Hi,
actually this requirement can be even more summarized:

It would be really (really) great if ES could provide 'highlighting' and
'fields' features from the search API without the need for each field to
be stored separately (by reusing stored json _source field).

Do you think this would be possible?

Tomislav

On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:

Hi,
I'm not sure I fully understand what will be implement as a
result/conclusion of discussion here, but I think I can define what I
would like to be implemented (from my point of view) pretty clearly as:

It would be great if ES, beside returning whole document source (in json
format) in search results, supported returning json type structure with
'matching fields' and/or requested fields. Only fields which are matched
by a query or requested would be returned (from _source json) and this
would be possible without storing each field separately. Value in these
fields would be either a whole field value (with highlighting applied)
or a highlighting snippet (for large textual fields).

Will something like that be possible?

Thanks,
Tomislav

On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:

No problem, I understand the general idea of the requirement.

-shay.banon

On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček lukas.vlcek@gmail.com
wrote:

    On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
    <lukas.vlcek@gmail.com> wrote:
            Yes, I did not realize that earlier but you are right
            that I will need to pass query into the highlight
            section as well.
            Take the following example:
            
            
            I need to display all candidates that match "dude
            java" query and then I want to allow user to click on
            individual name and get whole bio highlighted.
            
            
            So how I can go about this:
            First, I can get relevant documents using simple
            "query_string" query for "dude java". I can now
            display names of candidates without highlights and
            highlighted fragments from bio for each name, kind of
            basic search interface that already works now. But if
            I wanted to display highlighted name I would get
            something like "..e <em>Dude</em> Abid..." which is
            not what I want (sure, I can work with fragment size
            but that is just workaround and does not fit all
            situations). So when using that "query_string" query I
            would like to specify in the highlight section that
            the person.name should be highlighted with no
            fragments.
            
            
            Second, now, when the user clicks individual name,
            then I want to get whole bio highlighted.
            So I need to get specific document (by ID) and have
            the bio field highlighted (and the name field as well)
            The example of the query that could be used:
            
            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "highlight" : {
                "fields" : {
                  "_source" : {
                    "path" : "person.bio,person.name",
                    "fragmenter" : "classpath.to.NullFragmenter",
                    "query" : {
                      "query_string" : { "fields" :
            ["bio","name"], "query" : "dude java" }
                    }
                  }
                }
              }
            }'
            
            
            or I could use fields query:
            
            
            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "fields" : ["bio","name"],
              "highlight" : {
                "fields" : {
                  "bio" : {
                    "query" : {
                      "query_string" : { "fields" : ["bio"],
            "query" : "dude java" }
                    }
                  },
                  "name" : {
                    "query" : {
                      "query_string" : { "fields" : ["name"],
            "query" : "dude java" }
                    }
                  }
                }
              }
            }'
            
            
            The later query requires both bio and name to be
            stored (and this is where it gets back to Tomislav's
            original point I think).
            Ugh! I am complicating it way too much... but hope the
            request is clear now :-)
    
    
    Sure I am complicating it too much because in the later query
    example I forgot the specify NullFragmenter :-)
    
     
            
            
            Regards,
            Lukas
            
            
            
            2010/8/13 Shay Banon <shay.banon@elasticsearch.com>
            
            
                    ok, so you want to get the whole bio field
                    highlighted, so you would need to pass the
                    query to the get API as well, otherwise, there
                    is no way to highlight it (aside from other
                    things you need, like the option to do no
                    fragmentation and getting the actual data).
                    
                    
                    
                    On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
                    <lukas.vlcek@gmail.com> wrote:
                            Oh, and one more note, see below:
                            
                            On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
                            Vlček <lukas.vlcek@gmail.com> wrote:
                                    If I want to display whole bio
                                    highlighted then I can either
                                    get "_source" and cut bio from
                                    it on the client side but in
                                    this case I need to tell ES to
                                    use highlighting on it first.
                                    Or I need to specify in
                                    mapping that bio is also
                                    stored and use fields
                                    query http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
                            
                            
                            And the later also requires
                            specification of Fragmenter that
                            returns whole body, not fragments,
                            thus my reference to NullFragmenter,
                            which is not implemented in
                            FastVectorHighlighter API (as far as I
                            understand it), it can be found in the
                            older Highlighting API, thus I opened
                            also http://github.com/elasticsearch/elasticsearch/issues/issue/307 
                            
                            
                            May be it would be better if the
                            NullFragmenter-like functionality is
                            contributed directly into Lucene
                            FastVectorHighlighter API. I was
                            looking at the FVH API today and I
                            think I can try to implement such
                            Fragmenter.
                            
                            
                            
                                    
                                    
                                    Hope this makes it clear.
                                    (Sorry if I confused you).
                                    
                                    
                                    
                                    Lukas
                                    
                                    2010/8/12 Shay Banon
                                    <shay.banon@elasticsearch.com>
                                            Ahh, I see. So you
                                            would still need to
                                            provide a query to the
                                            GET api in order to do
                                            the highlighting,
                                            right?
                                            
                                            
                                            
                                            On Thu, Aug 12, 2010
                                            at 10:08 PM, Lukáš
                                            Vlček
                                            <lukas.vlcek@gmail.com> wrote:
                                                    Imagine search
                                                    app for HR:
                                                    Candidate
                                                    catalog (cool
                                                    name!).
                                                    The entities
                                                    stored in the
                                                    index are as
                                                    follows:
                                                    person: { id,
                                                    name, address,
                                                    bio }
                                                    Now I am using
                                                    just the REST
                                                    API. Say I
                                                    search for
                                                    "Java" and I
                                                    would like to
                                                    display list
                                                    of Names and
                                                    allow users to
                                                    click
                                                    individual
                                                    name which
                                                    would display
                                                    whole bio with
                                                    Java
                                                    highlighted in
                                                    it (here comes
                                                    the
                                                    highlighting
                                                    in the play!).
                                                    Now I can
                                                    display bio
                                                    (just using
                                                    GET REST API
                                                    with given
                                                    document ID
                                                    but not
                                                    highlighted.
                                                    So I was
                                                    thinking that
                                                    it would be
                                                    cool to have
                                                    this function.
                                                    
                                                    
                                                    Lukas
                                                    
                                                    2010/8/12 Shay
                                                    Banon
                                                    <shay.banon@elasticsearch.com>
                                                    
                                                    
                                                            So,
                                                            what
                                                            you
                                                            want
                                                            is to
                                                            be
                                                            able
                                                            to get
                                                            just
                                                            the
                                                            bio
                                                            field,
                                                            without the full source, and without the bio field being stored? If so, then the response I gave, where the logic might apply also to get fields using something like "source_field" notion applies here. It does mean that the full source will need to be retrieved and parsed. Not sure how highlighting comes into play here...
                                                            
                                                            
                                                            -shay.banon
                                                            
                                                            
                                                            
                                                            On
                                                            Thu,
                                                            Aug
                                                            12,
                                                            2010
                                                            at
                                                            9:53
                                                            PM,
                                                            Lukáš
                                                            Vlček
                                                            <lukas.vlcek@gmail.com> wrote:
                                                                    Actually, that ticket has two parts. One is Fragmenter related and the other one is possibility to tell, that I want to highlight some portion of _source data. Imagine I am using only REST API and for example if _source is a person with name, address and bio fields then I would like to tell that I want to highlight just the bio field (and I think the NullFragmenter would be needed for this if I want to display whole content of bio highlighted, not just fragments). The other possibility would be to define mapping for person in such a way that bio would be a stored field, then I could query for stored fields (not pulling the _source field) and tell the I want to apply NullFragmenter to this data while highlighting. But this gets back to the Tomislav's situation, because this would mean that bio is probably stored twice, once as a part of source and then separately as a stored bio field.
                                                                    
                                                                    
                                                                    Lukas
                                                                    
                                                                    
                                                                    
                                                                    On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon <shay.banon@elasticsearch.com> wrote:
                                                                            Not sure if it overlaps, fragmenter controls how to break the highlighted data, this relates to how to fetch that date to highlight.
                                                                            
                                                                            
                                                                            -shay.banon
                                                                            
                                                                            
                                                                            
                                                                            On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <lukas.vlcek@gmail.com> wrote:
                                                                                    One of differences is that the 308 issue was meant to return whole content of the _source or some of its fields (or stored fields if not using "_source"). But the point is that the user should be able to specify Fragmenter type (or provide custom implementation of Fragmenter).
                                                                                    
                                                                                    
                                                                                    
                                                                                    On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <lukas.vlcek@gmail.com> wrote:
                                                                                            If I read it correctly then I think it partly overlaps with http://github.com/elasticsearch/elasticsearch/issues/issue/308
                                                                                            
                                                                                            
                                                                                            Lukas
                                                                                            
                                                                                            
                                                                                            
                                                                                            On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <shay.banon@elasticsearch.com> wrote:
                                                                                                    Agreed. A bit tricky to implement, but possible. Also, note that this will require loading the full json from the index, and parse it in order to get the relevant parts from it. It won't be returned, but still loaded. So, it might not make sense when you have several big fields in the index, and you want to get fragments for one of them. But does make sense when having one big field.
                                                                                                    
                                                                                                    
                                                                                                    Also, if this is implemented, it should also be possible to get specific fields out of the json as a response as well (similar to asking for specific fields in the search request, maybe call them source_fileds).
                                                                                                    
                                                                                                    
                                                                                                    Open an issue for this?
                                                                                                    
                                                                                                    
                                                                                                    -shay.banon
                                                                                                    
                                                                                                    
                                                                                                    
                                                                                                    On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak <tpoljak@gmail.com> wrote:
                                                                                                            Hi,
                                                                                                            I really like all the features stored json (enabled _source) provides in
                                                                                                            both Java API (used it in indexing/updating) and REST API (used it for
                                                                                                            searching).
                                                                                                            
                                                                                                            I do however have one possible request for improvement regarding
                                                                                                            documents with large textual fields and overall highlighting.
                                                                                                            
                                                                                                            When there is a requirement to index/search documents with large textual
                                                                                                            fields (like 'content' with text in Mb, which is not unusual), returning
                                                                                                            a whole json for each result in result set can be impossible (if each
                                                                                                            json document has a few Mb in 'content' returning 30-50 results to a
                                                                                                            client doesn't sound realistic or even possible in acceptable time with
                                                                                                            usual 'Internet' bandwidth).
                                                                                                            
                                                                                                            But, usually it's acceptable (or even requirement) to display/return
                                                                                                            only highlighting snippets for 30-50 results matches and retrieve whole
                                                                                                            document (json source) only for a single document (when requested by
                                                                                                            exact ID).
                                                                                                            
                                                                                                            To be able to provide highlighting snippets for large textual
                                                                                                            ('content') field, it needs to be stored.
                                                                                                            
                                                                                                            Now we are in situation where because textual field is too big (makes
                                                                                                            impassible to return json source for 30-50 results to a client) we need
                                                                                                            to store it twice in index (once as a part of original json source in
                                                                                                            _source and second time as a 'content' field "store" : "yes" for
                                                                                                            highlighting). This make index a lot bigger.
                                                                                                            
                                                                                                            Also, if there is a requirement to display highlighting for all fields
                                                                                                            (separate highlight snippets for each field where match occurred,
                                                                                                            without mixing fields snippets -> stored _all field can not be used for
                                                                                                            highlighting in such case) then whole document (all fields) needs to be
                                                                                                            stored twice.
                                                                                                            
                                                                                                            In this case seems only logical to disable _source field (since all
                                                                                                            fields are stored anyway) and when whole document needs to be retrieved
                                                                                                            use (newly added) fields=* feature (I've read similar discussion thread
                                                                                                            which led to this enhancement)
                                                                                                            
                                                                                                            Here is my question/proposal:
                                                                                                            
                                                                                                            would it be possible to enable use of json _source field for 'field
                                                                                                            specific' highlighting, where matching snippet needs to be returned
                                                                                                            separately for each field name?
                                                                                                            
                                                                                                            Maybe to have term_vector for each field, but to somehow 'adjust' or
                                                                                                            recalculate positions_offsets to point to text snippet in _source
                                                                                                            instead of stored field?
                                                                                                            
                                                                                                            I know this is not a simple requirement, but if 'field specific'
                                                                                                            highlighting could somehow use stored json instead of requiring an
                                                                                                            individual field to be (separately) stored, that would make a great
                                                                                                            use/reuse of stored json _source (and no one would ever think of
                                                                                                            disabling it :)
                                                                                                            
                                                                                                            Tomislav

(Shay Banon) #18

Yea, the discussion took a different turn from the original request. Yes, it
is possible (with the mentioned downsides of needing to load the full source
and parsing it on the "fetch" phase within the specific node, nothing that
can't be solved by adding more replicas though if there are performance
problems, and you can do that dynamically in upcoming 0.9.1). Can you open a
feature request for it?

On Fri, Aug 13, 2010 at 5:47 PM, Tomislav Poljak tpoljak@gmail.com wrote:

Hi,
actually this requirement can be even more summarized:

It would be really (really) great if ES could provide 'highlighting' and
'fields' features from the search API without the need for each field to
be stored separately (by reusing stored json _source field).

Do you think this would be possible?

Tomislav

On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:

Hi,
I'm not sure I fully understand what will be implement as a
result/conclusion of discussion here, but I think I can define what I
would like to be implemented (from my point of view) pretty clearly as:

It would be great if ES, beside returning whole document source (in json
format) in search results, supported returning json type structure with
'matching fields' and/or requested fields. Only fields which are matched
by a query or requested would be returned (from _source json) and this
would be possible without storing each field separately. Value in these
fields would be either a whole field value (with highlighting applied)
or a highlighting snippet (for large textual fields).

Will something like that be possible?

Thanks,
Tomislav

On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:

No problem, I understand the general idea of the requirement.

-shay.banon

On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček lukas.vlcek@gmail.com
wrote:

    On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
    <lukas.vlcek@gmail.com> wrote:
            Yes, I did not realize that earlier but you are right
            that I will need to pass query into the highlight
            section as well.
            Take the following example:


            I need to display all candidates that match "dude
            java" query and then I want to allow user to click on
            individual name and get whole bio highlighted.


            So how I can go about this:
            First, I can get relevant documents using simple
            "query_string" query for "dude java". I can now
            display names of candidates without highlights and
            highlighted fragments from bio for each name, kind of
            basic search interface that already works now. But if
            I wanted to display highlighted name I would get
            something like "..e <em>Dude</em> Abid..." which is
            not what I want (sure, I can work with fragment size
            but that is just workaround and does not fit all
            situations). So when using that "query_string" query I
            would like to specify in the highlight section that
            the person.name should be highlighted with no
            fragments.


            Second, now, when the user clicks individual name,
            then I want to get whole bio highlighted.
            So I need to get specific document (by ID) and have
            the bio field highlighted (and the name field as well)
            The example of the query that could be used:

            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "highlight" : {
                "fields" : {
                  "_source" : {
                    "path" : "person.bio,person.name",
                    "fragmenter" : "classpath.to.NullFragmenter",
                    "query" : {
                      "query_string" : { "fields" :
            ["bio","name"], "query" : "dude java" }
                    }
                  }
                }
              }
            }'


            or I could use fields query:


            curl -XGET http://localhost:9200/_all/_search -d '
            { "query" : { "term" : { "person-id" : "1234" } },
              "fields" : ["bio","name"],
              "highlight" : {
                "fields" : {
                  "bio" : {
                    "query" : {
                      "query_string" : { "fields" : ["bio"],
            "query" : "dude java" }
                    }
                  },
                  "name" : {
                    "query" : {
                      "query_string" : { "fields" : ["name"],
            "query" : "dude java" }
                    }
                  }
                }
              }
            }'


            The later query requires both bio and name to be
            stored (and this is where it gets back to Tomislav's
            original point I think).
            Ugh! I am complicating it way too much... but hope the
            request is clear now :-)


    Sure I am complicating it too much because in the later query
    example I forgot the specify NullFragmenter :-)




            Regards,
            Lukas



            2010/8/13 Shay Banon <shay.banon@elasticsearch.com>


                    ok, so you want to get the whole bio field
                    highlighted, so you would need to pass the
                    query to the get API as well, otherwise, there
                    is no way to highlight it (aside from other
                    things you need, like the option to do no
                    fragmentation and getting the actual data).



                    On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
                    <lukas.vlcek@gmail.com> wrote:
                            Oh, and one more note, see below:

                            On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
                            Vlček <lukas.vlcek@gmail.com> wrote:
                                    If I want to display whole bio
                                    highlighted then I can either
                                    get "_source" and cut bio from
                                    it on the client side but in
                                    this case I need to tell ES to
                                    use highlighting on it first.
                                    Or I need to specify in
                                    mapping that bio is also
                                    stored and use fields
                                    query

http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/but again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).

                            And the later also requires
                            specification of Fragmenter that
                            returns whole body, not fragments,
                            thus my reference to NullFragmenter,
                            which is not implemented in
                            FastVectorHighlighter API (as far as I
                            understand it), it can be found in the
                            older Highlighting API, thus I opened
                            also

http://github.com/elasticsearch/elasticsearch/issues/issue/307

                            May be it would be better if the
                            NullFragmenter-like functionality is
                            contributed directly into Lucene
                            FastVectorHighlighter API. I was
                            looking at the FVH API today and I
                            think I can try to implement such
                            Fragmenter.





                                    Hope this makes it clear.
                                    (Sorry if I confused you).



                                    Lukas

                                    2010/8/12 Shay Banon
                                    <shay.banon@elasticsearch.com>
                                            Ahh, I see. So you
                                            would still need to
                                            provide a query to the
                                            GET api in order to do
                                            the highlighting,
                                            right?



                                            On Thu, Aug 12, 2010
                                            at 10:08 PM, Lukáš
                                            Vlček
                                            <lukas.vlcek@gmail.com>

wrote:

                                                    Imagine search
                                                    app for HR:
                                                    Candidate
                                                    catalog (cool
                                                    name!).
                                                    The entities
                                                    stored in the
                                                    index are as
                                                    follows:
                                                    person: { id,
                                                    name, address,
                                                    bio }
                                                    Now I am using
                                                    just the REST
                                                    API. Say I
                                                    search for
                                                    "Java" and I
                                                    would like to
                                                    display list
                                                    of Names and
                                                    allow users to
                                                    click
                                                    individual
                                                    name which
                                                    would display
                                                    whole bio with
                                                    Java
                                                    highlighted in
                                                    it (here comes
                                                    the
                                                    highlighting
                                                    in the play!).
                                                    Now I can
                                                    display bio
                                                    (just using
                                                    GET REST API
                                                    with given
                                                    document ID
                                                    but not
                                                    highlighted.
                                                    So I was
                                                    thinking that
                                                    it would be
                                                    cool to have
                                                    this function.


                                                    Lukas

                                                    2010/8/12 Shay
                                                    Banon
                                                    <

shay.banon@elasticsearch.com>

                                                            So,
                                                            what
                                                            you
                                                            want
                                                            is to
                                                            be
                                                            able
                                                            to get
                                                            just
                                                            the
                                                            bio
                                                            field,
                                                            without

the full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...

-shay.banon

                                                            On
                                                            Thu,
                                                            Aug
                                                            12,
                                                            2010
                                                            at
                                                            9:53
                                                            PM,
                                                            Lukáš
                                                            Vlček
                                                            <

lukas.vlcek@gmail.com> wrote:

Actually, that ticket has two parts. One is Fragmenter related and the other
one is possibility to tell, that I want to highlight some portion of _source
data. Imagine I am using only REST API and for example if _source is a
person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.

Lukas

On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon shay.banon@elasticsearch.com
wrote:

    Not sure if it overlaps, fragmenter controls how to break the

highlighted data, this relates to how to fetch that date to highlight.

    -shay.banon
    On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <

lukas.vlcek@gmail.com> wrote:

            One of differences is that the 308 issue was meant to return

whole content of the _source or some of its fields (or stored fields if not
using "_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).

            On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <

lukas.vlcek@gmail.com> wrote:

                    If I read it correctly then I think it partly

overlaps with
http://github.com/elasticsearch/elasticsearch/issues/issue/308

                    Lukas
                    On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <

shay.banon@elasticsearch.com> wrote:

                            Agreed. A bit tricky to implement, but

possible. Also, note that this will require loading the full json from the
index, and parse it in order to get the relevant parts from it. It won't be
returned, but still loaded. So, it might not make sense when you have
several big fields in the index, and you want to get fragments for one of
them. But does make sense when having one big field.

                            Also, if this is implemented, it should also

be possible to get specific fields out of the json as a response as well
(similar to asking for specific fields in the search request, maybe call
them source_fileds).

                            Open an issue for this?
                            -shay.banon
                            On Thu, Aug 12, 2010 at 6:06 PM, Tomislav

Poljak tpoljak@gmail.com wrote:

                                    Hi,
                                    I really like all the features

stored json (enabled _source) provides in

                                    both Java API (used it in

indexing/updating) and REST API (used it for

                                    searching).
                                    I do however have one possible

request for improvement regarding

                                    documents with large textual fields

and overall highlighting.

                                    When there is a requirement to

index/search documents with large textual

                                    fields (like 'content' with text in

Mb, which is not unusual), returning

                                    a whole json for each result in

result set can be impossible (if each

                                    json document has a few Mb in

'content' returning 30-50 results to a

                                    client doesn't sound realistic or

even possible in acceptable time with

                                    usual 'Internet' bandwidth).
                                    But, usually it's acceptable (or

even requirement) to display/return

                                    only highlighting snippets for 30-50

results matches and retrieve whole

                                    document (json source) only for a

single document (when requested by

                                    exact ID).
                                    To be able to provide highlighting

snippets for large textual

                                    ('content') field, it needs to be

stored.

                                    Now we are in situation where

because textual field is too big (makes

                                    impassible to return json source for

30-50 results to a client) we need

                                    to store it twice in index (once as

a part of original json source in

                                    _source and second time as a

'content' field "store" : "yes" for

                                    highlighting). This make index a lot

bigger.

                                    Also, if there is a requirement to

display highlighting for all fields

                                    (separate highlight snippets for

each field where match occurred,

                                    without mixing fields snippets ->

stored _all field can not be used for

                                    highlighting in such case) then

whole document (all fields) needs to be

                                    stored twice.
                                    In this case seems only logical to

disable _source field (since all

                                    fields are stored anyway) and when

whole document needs to be retrieved

                                    use (newly added) fields=* feature

(I've read similar discussion thread

                                    which led to this enhancement)
                                    Here is my question/proposal:
                                    would it be possible to enable use

of json _source field for 'field

                                    specific' highlighting, where

matching snippet needs to be returned

                                    separately for each field name?
                                    Maybe to have term_vector for each

field, but to somehow 'adjust' or

                                    recalculate positions_offsets to

point to text snippet in _source

                                    instead of stored field?
                                    I know this is not a simple

requirement, but if 'field specific'

                                    highlighting could somehow use

stored json instead of requiring an

                                    individual field to be (separately)

stored, that would make a great

                                    use/reuse of stored json _source

(and no one would ever think of

                                    disabling it :)
                                    Tomislav

(Tomislav Poljak) #19

Hi,
I've opened issue 319 for it, here

http://github.com/elasticsearch/elasticsearch/issues/issue/319

Hope this is fine,

 Tomislav

On Fri, 2010-08-13 at 18:00 +0300, Shay Banon wrote:

Yea, the discussion took a different turn from the original request.
Yes, it is possible (with the mentioned downsides of needing to load
the full source and parsing it on the "fetch" phase within the
specific node, nothing that can't be solved by adding more replicas
though if there are performance problems, and you can do that
dynamically in upcoming 0.9.1). Can you open a feature request for it?

On Fri, Aug 13, 2010 at 5:47 PM, Tomislav Poljak tpoljak@gmail.com
wrote:
Hi,
actually this requirement can be even more summarized:

    It would be really (really) great if ES could provide
    'highlighting' and
    'fields' features from the search API without the need for
    each field to
    be stored separately (by reusing stored json _source field).
    
    Do you think this would be possible?
    
    Tomislav
    
    
    
    On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:
    > Hi,
    > I'm not sure I fully understand what will be implement as a
    > result/conclusion of discussion here, but I think I can
    define what I
    > would like to be implemented (from my point of view) pretty
    clearly as:
    >
    > It would be great if ES, beside returning whole document
    source (in json
    > format) in search results, supported returning json type
    structure with
    > 'matching fields' and/or requested fields. Only fields which
    are matched
    > by a query or requested would be returned (from _source
    json) and this
    > would be possible without storing each field separately.
    Value in these
    > fields would be either a whole field value (with
    highlighting applied)
    > or a highlighting snippet (for large textual fields).
    >
    > Will something like that be possible?
    >
    > Thanks,
    >       Tomislav
    >
    >
    > On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:
    > > No problem, I understand the general idea of the
    requirement.
    > >
    > >
    > > -shay.banon
    > >
    > > On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček
    <lukas.vlcek@gmail.com>
    > > wrote:
    > >
    > >
    > >
    > >         On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
    > >         <lukas.vlcek@gmail.com> wrote:
    > >                 Yes, I did not realize that earlier but
    you are right
    > >                 that I will need to pass query into the
    highlight
    > >                 section as well.
    > >                 Take the following example:
    > >
    > >
    > >                 I need to display all candidates that
    match "dude
    > >                 java" query and then I want to allow user
    to click on
    > >                 individual name and get whole bio
    highlighted.
    > >
    > >
    > >                 So how I can go about this:
    > >                 First, I can get relevant documents using
    simple
    > >                 "query_string" query for "dude java". I
    can now
    > >                 display names of candidates without
    highlights and
    > >                 highlighted fragments from bio for each
    name, kind of
    > >                 basic search interface that already works
    now. But if
    > >                 I wanted to display highlighted name I
    would get
    > >                 something like "..e <em>Dude</em> Abid..."
    which is
    > >                 not what I want (sure, I can work with
    fragment size
    > >                 but that is just workaround and does not
    fit all
    > >                 situations). So when using that
    "query_string" query I
    > >                 would like to specify in the highlight
    section that
    > >                 the person.name should be highlighted with
    no
    > >                 fragments.
    > >
    > >
    > >                 Second, now, when the user clicks
    individual name,
    > >                 then I want to get whole bio highlighted.
    > >                 So I need to get specific document (by ID)
    and have
    > >                 the bio field highlighted (and the name
    field as well)
    > >                 The example of the query that could be
    used:
    > >
    > >                 curl -XGET
    http://localhost:9200/_all/_search -d '
    > >                 { "query" : { "term" : { "person-id" :
    "1234" } },
    > >                   "highlight" : {
    > >                     "fields" : {
    > >                       "_source" : {
    > >                         "path" : "person.bio,person.name",
    > >                         "fragmenter" :
    "classpath.to.NullFragmenter",
    > >                         "query" : {
    > >                           "query_string" : { "fields" :
    > >                 ["bio","name"], "query" : "dude java" }
    > >                         }
    > >                       }
    > >                     }
    > >                   }
    > >                 }'
    > >
    > >
    > >                 or I could use fields query:
    > >
    > >
    > >                 curl -XGET
    http://localhost:9200/_all/_search -d '
    > >                 { "query" : { "term" : { "person-id" :
    "1234" } },
    > >                   "fields" : ["bio","name"],
    > >                   "highlight" : {
    > >                     "fields" : {
    > >                       "bio" : {
    > >                         "query" : {
    > >                           "query_string" : { "fields" :
    ["bio"],
    > >                 "query" : "dude java" }
    > >                         }
    > >                       },
    > >                       "name" : {
    > >                         "query" : {
    > >                           "query_string" : { "fields" :
    ["name"],
    > >                 "query" : "dude java" }
    > >                         }
    > >                       }
    > >                     }
    > >                   }
    > >                 }'
    > >
    > >
    > >                 The later query requires both bio and name
    to be
    > >                 stored (and this is where it gets back to
    Tomislav's
    > >                 original point I think).
    > >                 Ugh! I am complicating it way too much...
    but hope the
    > >                 request is clear now :-)
    > >
    > >
    > >         Sure I am complicating it too much because in the
    later query
    > >         example I forgot the specify NullFragmenter :-)
    > >
    > >
    > >
    > >
    > >                 Regards,
    > >                 Lukas
    > >
    > >
    > >
    > >                 2010/8/13 Shay Banon
    <shay.banon@elasticsearch.com>
    > >
    > >
    > >                         ok, so you want to get the whole
    bio field
    > >                         highlighted, so you would need to
    pass the
    > >                         query to the get API as well,
    otherwise, there
    > >                         is no way to highlight it (aside
    from other
    > >                         things you need, like the option
    to do no
    > >                         fragmentation and getting the
    actual data).
    > >
    > >
    > >
    > >                         On Thu, Aug 12, 2010 at 10:34 PM,
    Lukáš Vlček
    > >                         <lukas.vlcek@gmail.com> wrote:
    > >                                 Oh, and one more note, see
    below:
    > >
    > >                                 On Thu, Aug 12, 2010 at
    9:22 PM, Lukáš
    > >                                 Vlček
    <lukas.vlcek@gmail.com> wrote:
    > >                                         If I want to
    display whole bio
    > >                                         highlighted then I
    can either
    > >                                         get "_source" and
    cut bio from
    > >                                         it on the client
    side but in
    > >                                         this case I need
    to tell ES to
    > >                                         use highlighting
    on it first.
    > >                                         Or I need to
    specify in
    > >                                         mapping that bio
    is also
    > >                                         stored and use
    fields
    > >                                         query
    http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/ but again I need to tell ES to highlight it. And in neither case I want only fragments, I want WHOLE content of the field. The first approach is not possible now the later is possible but required bio to be explicitly stored (and it is already stored in _source).
    > >
    > >
    > >                                 And the later also
    requires
    > >                                 specification of
    Fragmenter that
    > >                                 returns whole body, not
    fragments,
    > >                                 thus my reference to
    NullFragmenter,
    > >                                 which is not implemented
    in
    > >                                 FastVectorHighlighter API
    (as far as I
    > >                                 understand it), it can be
    found in the
    > >                                 older Highlighting API,
    thus I opened
    > >                                 also
    http://github.com/elasticsearch/elasticsearch/issues/issue/307
    > >
    > >
    > >                                 May be it would be better
    if the
    > >                                 NullFragmenter-like
    functionality is
    > >                                 contributed directly into
    Lucene
    > >                                 FastVectorHighlighter API.
    I was
    > >                                 looking at the FVH API
    today and I
    > >                                 think I can try to
    implement such
    > >                                 Fragmenter.
    > >
    > >
    > >
    > >
    > >
    > >                                         Hope this makes it
    clear.
    > >                                         (Sorry if I
    confused you).
    > >
    > >
    > >
    > >                                         Lukas
    > >
    > >                                         2010/8/12 Shay
    Banon
    > >
    <shay.banon@elasticsearch.com>
    > >                                                 Ahh, I
    see. So you
    > >                                                 would
    still need to
    > >                                                 provide a
    query to the
    > >                                                 GET api in
    order to do
    > >                                                 the
    highlighting,
    > >                                                 right?
    > >
    > >
    > >
    > >                                                 On Thu,
    Aug 12, 2010
    > >                                                 at 10:08
    PM, Lukáš
    > >                                                 Vlček
    > >
    <lukas.vlcek@gmail.com> wrote:
    > >
    Imagine search
    > >
    app for HR:
    > >
    Candidate
    > >
    catalog (cool
    > >
    name!).
    > >
    The entities
    > >
    stored in the
    > >
    index are as
    > >
    follows:
    > >
    person: { id,
    > >
    name, address,
    > >
    bio }
    > >
    Now I am using
    > >
    just the REST
    > >
    API. Say I
    > >
    search for
    > >
    "Java" and I
    > >
    would like to
    > >
    display list
    > >                                                         of
    Names and
    > >
    allow users to
    > >
    click
    > >
    individual
    > >
    name which
    > >
    would display
    > >
    whole bio with
    > >
    Java
    > >
    highlighted in
    > >                                                         it
    (here comes
    > >
    the
    > >
    highlighting
    > >                                                         in
    the play!).
    > >
    Now I can
    > >
    display bio
    > >
    (just using
    > >
    GET REST API
    > >
    with given
    > >
    document ID
    > >
    but not
    > >
    highlighted.
    > >                                                         So
    I was
    > >
    thinking that
    > >                                                         it
    would be
    > >
    cool to have
    > >
    this function.
    > >
    > >
    > >
    Lukas
    > >
    > >
    2010/8/12 Shay
    > >
    Banon
    > >
    <shay.banon@elasticsearch.com>
    > >
    > >
    > >
    So,
    > >
    what
    > >
    you
    > >
    want
    > >
    is to
    > >
    be
    > >
    able
    > >
    to get
    > >
    just
    > >
    the
    > >
    bio
    > >
    field,
    > >
    without the full source, and without the bio field being
    stored? If so, then the response I gave, where the logic might
    apply also to get fields using something like "source_field"
    notion applies here. It does mean that the full source will
    need to be retrieved and parsed. Not sure how highlighting
    comes into play here...
    > >
    > >
    > >
    -shay.banon
    > >
    > >
    > >
    > >
    On
    > >
    Thu,
    > >
    Aug
    > >
    12,
    > >
    2010
    > >
    at
    > >
    9:53
    > >
    PM,
    > >
    Lukáš
    > >
    Vlček
    > >
    <lukas.vlcek@gmail.com> wrote:
    > >
    Actually, that ticket has two parts. One is Fragmenter related
    and the other one is possibility to tell, that I want to
    highlight some portion of _source data. Imagine I am using
    only REST API and for example if _source is a person with
    name, address and bio fields then I would like to tell that I
    want to highlight just the bio field (and I think the
    NullFragmenter would be needed for this if I want to display
    whole content of bio highlighted, not just fragments). The
    other possibility would be to define mapping for person in
    such a way that bio would be a stored field, then I could
    query for stored fields (not pulling the _source field) and
    tell the I want to apply NullFragmenter to this data while
    highlighting. But this gets back to the Tomislav's situation,
    because this would mean that bio is probably stored twice,
    once as a part of source and then separately as a stored bio
    field.
    > >
    > >
    > >
    Lukas
    > >
    > >
    > >
    > >
    On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
    > >
    Not sure if it overlaps, fragmenter controls how to break the
    highlighted data, this relates to how to fetch that date to
    highlight.
    > >
    > >
    > >
    -shay.banon
    > >
    > >
    > >
    > >
    On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček
    <lukas.vlcek@gmail.com> wrote:
    > >
    One of differences is that the 308 issue was meant to return
    whole content of the _source or some of its fields (or stored
    fields if not using "_source"). But the point is that the user
    should be able to specify Fragmenter type (or provide custom
    implementation of Fragmenter).
    > >
    > >
    > >
    > >
    On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček
    <lukas.vlcek@gmail.com> wrote:
    > >
    If I read it correctly then I think it partly overlaps with
    http://github.com/elasticsearch/elasticsearch/issues/issue/308
    > >
    > >
    > >
    Lukas
    > >
    > >
    > >
    > >
    On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon
    <shay.banon@elasticsearch.com> wrote:
    > >
    Agreed. A bit tricky to implement, but possible. Also, note
    that this will require loading the full json from the index,
    and parse it in order to get the relevant parts from it. It
    won't be returned, but still loaded. So, it might not make
    sense when you have several big fields in the index, and you
    want to get fragments for one of them. But does make sense
    when having one big field.
    > >
    > >
    > >
    Also, if this is implemented, it should also be possible to
    get specific fields out of the json as a response as well
    (similar to asking for specific fields in the search request,
    maybe call them source_fileds).
    > >
    > >
    > >
    Open an issue for this?
    > >
    > >
    > >
    -shay.banon
    > >
    > >
    > >
    > >
    On Thu, Aug 12, 2010 at 6:06 PM, Tomislav Poljak
    <tpoljak@gmail.com> wrote:
    > >
    Hi,
    > >
    I really like all the features stored json (enabled _source)
    provides in
    > >
    both Java API (used it in indexing/updating) and REST API
    (used it for
    > >
    searching).
    > >
    > >
    I do however have one possible request for improvement
    regarding
    > >
    documents with large textual fields and overall highlighting.
    > >
    > >
    When there is a requirement to index/search documents with
    large textual
    > >
    fields (like 'content' with text in Mb, which is not unusual),
    returning
    > >
    a whole json for each result in result set can be impossible
    (if each
    > >
    json document has a few Mb in 'content' returning 30-50
    results to a
    > >
    client doesn't sound realistic or even possible in acceptable
    time with
    > >
    usual 'Internet' bandwidth).
    > >
    > >
    But, usually it's acceptable (or even requirement) to
    display/return
    > >
    only highlighting snippets for 30-50 results matches and
    retrieve whole
    > >
    document (json source) only for a single document (when
    requested by
    > >
    exact ID).
    > >
    > >
    To be able to provide highlighting snippets for large textual
    > >
    ('content') field, it needs to be stored.
    > >
    > >
    Now we are in situation where because textual field is too big
    (makes
    > >
    impassible to return json source for 30-50 results to a
    client) we need
    > >
    to store it twice in index (once as a part of original json
    source in
    > >
    _source and second time as a 'content' field "store" : "yes"
    for
    > >
    highlighting). This make index a lot bigger.
    > >
    > >
    Also, if there is a requirement to display highlighting for
    all fields
    > >
    (separate highlight snippets for each field where match
    occurred,
    > >
    without mixing fields snippets -> stored _all field can not be
    used for
    > >
    highlighting in such case) then whole document (all fields)
    needs to be
    > >
    stored twice.
    > >
    > >
    In this case seems only logical to disable _source field
    (since all
    > >
    fields are stored anyway) and when whole document needs to be
    retrieved
    > >
    use (newly added) fields=* feature (I've read similar
    discussion thread
    > >
    which led to this enhancement)
    > >
    > >
    Here is my question/proposal:
    > >
    > >
    would it be possible to enable use of json _source field for
    'field
    > >
    specific' highlighting, where matching snippet needs to be
    returned
    > >
    separately for each field name?
    > >
    > >
    Maybe to have term_vector for each field, but to somehow
    'adjust' or
    > >
    recalculate positions_offsets to point to text snippet in
    _source
    > >
    instead of stored field?
    > >
    > >
    I know this is not a simple requirement, but if 'field
    specific'
    > >
    highlighting could somehow use stored json instead of
    requiring an
    > >
    individual field to be (separately) stored, that would make a
    great
    > >
    use/reuse of stored json _source (and no one would ever think
    of
    > >
    disabling it :)
    > >
    > >
    Tomislav
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >

(system) #20