Yea, the discussion took a different turn from the original request. Yes, it
is possible (with the mentioned downsides of needing to load the full source
and parsing it on the "fetch" phase within the specific node, nothing that
can't be solved by adding more replicas though if there are performance
problems, and you can do that dynamically in upcoming 0.9.1). Can you open a
feature request for it?
Hi,
actually this requirement can be even more summarized:
It would be really (really) great if ES could provide 'highlighting' and
'fields' features from the search API without the need for each field to
be stored separately (by reusing stored json _source field).
Do you think this would be possible?
Tomislav
On Fri, 2010-08-13 at 16:30 +0200, Tomislav Poljak wrote:
Hi,
I'm not sure I fully understand what will be implement as a
result/conclusion of discussion here, but I think I can define what I
would like to be implemented (from my point of view) pretty clearly as:
It would be great if ES, beside returning whole document source (in json
format) in search results, supported returning json type structure with
'matching fields' and/or requested fields. Only fields which are matched
by a query or requested would be returned (from _source json) and this
would be possible without storing each field separately. Value in these
fields would be either a whole field value (with highlighting applied)
or a highlighting snippet (for large textual fields).
Will something like that be possible?
Thanks,
Tomislav
On Fri, 2010-08-13 at 12:50 +0300, Shay Banon wrote:
No problem, I understand the general idea of the requirement.
-shay.banon
On Fri, Aug 13, 2010 at 11:21 AM, Lukáš Vlček lukas.vlcek@gmail.com
wrote:
On Fri, Aug 13, 2010 at 10:20 AM, Lukáš Vlček
<lukas.vlcek@gmail.com> wrote:
Yes, I did not realize that earlier but you are right
that I will need to pass query into the highlight
section as well.
Take the following example:
I need to display all candidates that match "dude
java" query and then I want to allow user to click on
individual name and get whole bio highlighted.
So how I can go about this:
First, I can get relevant documents using simple
"query_string" query for "dude java". I can now
display names of candidates without highlights and
highlighted fragments from bio for each name, kind of
basic search interface that already works now. But if
I wanted to display highlighted name I would get
something like "..e <em>Dude</em> Abid..." which is
not what I want (sure, I can work with fragment size
but that is just workaround and does not fit all
situations). So when using that "query_string" query I
would like to specify in the highlight section that
the person.name should be highlighted with no
fragments.
Second, now, when the user clicks individual name,
then I want to get whole bio highlighted.
So I need to get specific document (by ID) and have
the bio field highlighted (and the name field as well)
The example of the query that could be used:
curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"highlight" : {
"fields" : {
"_source" : {
"path" : "person.bio,person.name",
"fragmenter" : "classpath.to.NullFragmenter",
"query" : {
"query_string" : { "fields" :
["bio","name"], "query" : "dude java" }
}
}
}
}
}'
or I could use fields query:
curl -XGET http://localhost:9200/_all/_search -d '
{ "query" : { "term" : { "person-id" : "1234" } },
"fields" : ["bio","name"],
"highlight" : {
"fields" : {
"bio" : {
"query" : {
"query_string" : { "fields" : ["bio"],
"query" : "dude java" }
}
},
"name" : {
"query" : {
"query_string" : { "fields" : ["name"],
"query" : "dude java" }
}
}
}
}
}'
The later query requires both bio and name to be
stored (and this is where it gets back to Tomislav's
original point I think).
Ugh! I am complicating it way too much... but hope the
request is clear now :-)
Sure I am complicating it too much because in the later query
example I forgot the specify NullFragmenter :-)
Regards,
Lukas
2010/8/13 Shay Banon <shay.banon@elasticsearch.com>
ok, so you want to get the whole bio field
highlighted, so you would need to pass the
query to the get API as well, otherwise, there
is no way to highlight it (aside from other
things you need, like the option to do no
fragmentation and getting the actual data).
On Thu, Aug 12, 2010 at 10:34 PM, Lukáš Vlček
<lukas.vlcek@gmail.com> wrote:
Oh, and one more note, see below:
On Thu, Aug 12, 2010 at 9:22 PM, Lukáš
Vlček <lukas.vlcek@gmail.com> wrote:
If I want to display whole bio
highlighted then I can either
get "_source" and cut bio from
it on the client side but in
this case I need to tell ES to
use highlighting on it first.
Or I need to specify in
mapping that bio is also
stored and use fields
query
http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/fields/but again I need to tell ES to highlight it. And in neither case I want only
fragments, I want WHOLE content of the field. The first approach is not
possible now the later is possible but required bio to be explicitly stored
(and it is already stored in _source).
And the later also requires
specification of Fragmenter that
returns whole body, not fragments,
thus my reference to NullFragmenter,
which is not implemented in
FastVectorHighlighter API (as far as I
understand it), it can be found in the
older Highlighting API, thus I opened
also
Highlighting API: Add support for custom FragmentsBuilder · Issue #307 · elastic/elasticsearch · GitHub
May be it would be better if the
NullFragmenter-like functionality is
contributed directly into Lucene
FastVectorHighlighter API. I was
looking at the FVH API today and I
think I can try to implement such
Fragmenter.
Hope this makes it clear.
(Sorry if I confused you).
Lukas
2010/8/12 Shay Banon
<shay.banon@elasticsearch.com>
Ahh, I see. So you
would still need to
provide a query to the
GET api in order to do
the highlighting,
right?
On Thu, Aug 12, 2010
at 10:08 PM, Lukáš
Vlček
<lukas.vlcek@gmail.com>
wrote:
Imagine search
app for HR:
Candidate
catalog (cool
name!).
The entities
stored in the
index are as
follows:
person: { id,
name, address,
bio }
Now I am using
just the REST
API. Say I
search for
"Java" and I
would like to
display list
of Names and
allow users to
click
individual
name which
would display
whole bio with
Java
highlighted in
it (here comes
the
highlighting
in the play!).
Now I can
display bio
(just using
GET REST API
with given
document ID
but not
highlighted.
So I was
thinking that
it would be
cool to have
this function.
Lukas
2010/8/12 Shay
Banon
<
shay.banon@elasticsearch.com>
So,
what
you
want
is to
be
able
to get
just
the
bio
field,
without
the full source, and without the bio field being stored? If so, then the
response I gave, where the logic might apply also to get fields using
something like "source_field" notion applies here. It does mean that the
full source will need to be retrieved and parsed. Not sure how highlighting
comes into play here...
-shay.banon
On
Thu,
Aug
12,
2010
at
9:53
PM,
Lukáš
Vlček
<
lukas.vlcek@gmail.com> wrote:
Actually, that ticket has two parts. One is Fragmenter related and the other
one is possibility to tell, that I want to highlight some portion of _source
data. Imagine I am using only REST API and for example if _source is a
person with name, address and bio fields then I would like to tell that I
want to highlight just the bio field (and I think the NullFragmenter would
be needed for this if I want to display whole content of bio highlighted,
not just fragments). The other possibility would be to define mapping for
person in such a way that bio would be a stored field, then I could query
for stored fields (not pulling the _source field) and tell the I want to
apply NullFragmenter to this data while highlighting. But this gets back to
the Tomislav's situation, because this would mean that bio is probably
stored twice, once as a part of source and then separately as a stored bio
field.
Lukas
On Thu, Aug 12, 2010 at 8:41 PM, Shay Banon shay.banon@elasticsearch.com
wrote:
Not sure if it overlaps, fragmenter controls how to break the
highlighted data, this relates to how to fetch that date to highlight.
-shay.banon
On Thu, Aug 12, 2010 at 9:08 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:
One of differences is that the 308 issue was meant to return
whole content of the _source or some of its fields (or stored fields if not
using "_source"). But the point is that the user should be able to specify
Fragmenter type (or provide custom implementation of Fragmenter).
On Thu, Aug 12, 2010 at 8:03 PM, Lukáš Vlček <
lukas.vlcek@gmail.com> wrote:
If I read it correctly then I think it partly
overlaps with
Highlighting API: Apply highlighting to REST GET operation · Issue #308 · elastic/elasticsearch · GitHub
Lukas
On Thu, Aug 12, 2010 at 5:53 PM, Shay Banon <
shay.banon@elasticsearch.com> wrote:
Agreed. A bit tricky to implement, but
possible. Also, note that this will require loading the full json from the
index, and parse it in order to get the relevant parts from it. It won't be
returned, but still loaded. So, it might not make sense when you have
several big fields in the index, and you want to get fragments for one of
them. But does make sense when having one big field.
Also, if this is implemented, it should also
be possible to get specific fields out of the json as a response as well
(similar to asking for specific fields in the search request, maybe call
them source_fileds).
Open an issue for this?
-shay.banon
On Thu, Aug 12, 2010 at 6:06 PM, Tomislav
Poljak tpoljak@gmail.com wrote:
Hi,
I really like all the features
stored json (enabled _source) provides in
both Java API (used it in
indexing/updating) and REST API (used it for
searching).
I do however have one possible
request for improvement regarding
documents with large textual fields
and overall highlighting.
When there is a requirement to
index/search documents with large textual
fields (like 'content' with text in
Mb, which is not unusual), returning
a whole json for each result in
result set can be impossible (if each
json document has a few Mb in
'content' returning 30-50 results to a
client doesn't sound realistic or
even possible in acceptable time with
usual 'Internet' bandwidth).
But, usually it's acceptable (or
even requirement) to display/return
only highlighting snippets for 30-50
results matches and retrieve whole
document (json source) only for a
single document (when requested by
exact ID).
To be able to provide highlighting
snippets for large textual
('content') field, it needs to be
stored.
Now we are in situation where
because textual field is too big (makes
impassible to return json source for
30-50 results to a client) we need
to store it twice in index (once as
a part of original json source in
_source and second time as a
'content' field "store" : "yes" for
highlighting). This make index a lot
bigger.
Also, if there is a requirement to
display highlighting for all fields
(separate highlight snippets for
each field where match occurred,
without mixing fields snippets ->
stored _all field can not be used for
highlighting in such case) then
whole document (all fields) needs to be
stored twice.
In this case seems only logical to
disable _source field (since all
fields are stored anyway) and when
whole document needs to be retrieved
use (newly added) fields=* feature
(I've read similar discussion thread
which led to this enhancement)
Here is my question/proposal:
would it be possible to enable use
of json _source field for 'field
specific' highlighting, where
matching snippet needs to be returned
separately for each field name?
Maybe to have term_vector for each
field, but to somehow 'adjust' or
recalculate positions_offsets to
point to text snippet in _source
instead of stored field?
I know this is not a simple
requirement, but if 'field specific'
highlighting could somehow use
stored json instead of requiring an
individual field to be (separately)
stored, that would make a great
use/reuse of stored json _source
(and no one would ever think of
disabling it :)
Tomislav