I'm trying to do something that I thought would have been a common
enough use case, but I can't find a way. I've played around with
parameters like number_of_fragments and fragment_size, but they don't
seem to address my need.
Essentially, I want to retrieve an individual fragment for each
occurrence of the query term. (As I said, I would have thought this
was common.) If we use the term "hit" to mean an index record that
matches, then the default seems to be to return up to 5 fragments for
each hit (configurable to a different value with number_of_fragments).
There are two problems with that for my use case.
(i) I want an unlimited number of fragments, i.e. one for every
occurrence of the query term. I thought that setting
number_of_fragments to zero would signify "unlimited", but it actually
means that the entire field is returned as the fragment.
(ii) By default, if there are several occurrences within a fragment,
they are all highlighted. I want just one highlighted and the next
fragment to highlight the subsequent one.
Lastly, it would be nice if the highlight occurred midway in the
fragment, yet the way fragments are returned is such that the
highlight is sometimes at the beginning or the end (even if there is
other enclosing text in the field). Is this configurable?
Any pointers would be greatly appreciated. Hopefully I've made myself
clear as to what I'm trying to do. Am I barking up the wrong tree?
Should I be looking at a different engine for this kind of thing?
I'm trying to do something that I thought would have been a common
enough use case, but I can't find a way. I've played around with
parameters like number_of_fragments and fragment_size, but they don't
seem to address my need.
Essentially, I want to retrieve an individual fragment for each
occurrence of the query term. (As I said, I would have thought this
was common.) If we use the term "hit" to mean an index record that
matches, then the default seems to be to return up to 5 fragments for
each hit (configurable to a different value with number_of_fragments).
There are two problems with that for my use case.
(i) I want an unlimited number of fragments, i.e. one for every
occurrence of the query term. I thought that setting
number_of_fragments to zero would signify "unlimited", but it actually
means that the entire field is returned as the fragment.
(ii) By default, if there are several occurrences within a fragment,
they are all highlighted. I want just one highlighted and the next
fragment to highlight the subsequent one.
Lastly, it would be nice if the highlight occurred midway in the
fragment, yet the way fragments are returned is such that the
highlight is sometimes at the beginning or the end (even if there is
other enclosing text in the field). Is this configurable?
Any pointers would be greatly appreciated. Hopefully I've made myself
clear as to what I'm trying to do. Am I barking up the wrong tree?
Should I be looking at a different engine for this kind of thing?
I haven't verified it, but have you tried to specify a small fragment size value?
Regarding num_of_fragments, you can specify a really high value.
Thanks Shay. Yes, I had tried a small fragment size. It does what I
expected it would do. Essentially, I'm more likely to get an
individual fragment per occurrence. But given that occurrences can
happen arbitrarily close together, it's not exact. More to the point,
that works against my other issue, in that the highlight is much more
likely to occur right at the beginning or end of the fragment. (Giving
no left or not right context.) And showing the context is the whole
point in my use case.
Very briefly, what I need to do is tabulate each occurrence of the
query term with the previous context in the column to the left and the
subsequent context in the column to the right. So I'll be changing the
highlight wrapper tags to table-related tags. Additional table cells
in a row will break the layout, so it has to be just one occurrence
per row. In other words, if I was searching for the term "blonde" in
the Wikipedia entry for the Dylan album "Blonde on Blonde", I need to
get separate fragments like:
and Highway 61 Revisited. Blonde on Blonde is often ranked by
critics
Highway 61 Revisited. Blonde on Blonde is often ranked by
critics as
I can then tabulate that. One option is to just get one big fragment
with highlights and process it by hand. But engines are faster at that
sort of thing and it seemed likely that it was a common use case.
Perhaps I'm wrong.
Highway 61 Revisited. Blonde on Blonde is often ranked by
61 Revisited. Blonde on Blonde is often ranked by critics
OK, I did a bit more digging and found a post detailing how someone
did this kind of concordance view at the Lucene level:
(The output near the bottom gives the idea most clearly.) So my
question has shifted. Is it feasible to patch the underlying Lucene
implementation in Elasticsearch to supplement the built-in
functionality? Or does Elasticsearch depend on a very specific
packaged implementation?
(The output near the bottom gives the idea most clearly.) So my
question has shifted. Is it feasible to patch the underlying Lucene
implementation in Elasticsearch to supplement the built-in
functionality? Or does Elasticsearch depend on a very specific
packaged implementation?
This is way late, but I've implemented this in LUCENE-5317/LUCENE-5318, which are both available on github and now Maven Central, lumped together under lucene-5317.
There's a traditional ConcordanceSearcher (LUCENE-5317), but also a WindowVisitor API (LUCENE-5318).
The following snippet would return a List<TermIDF> results capturing 1 word before your target and 0 words after your target. The visitor would visit the first 10000 windows.
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(reader);
IDFIndexCalc idfCalc = new IDFIndexCalc(reader);
CooccurVisitor visitor = new CooccurVisitor(
FIELD, 1, 0, new WGrammer(1, 1, false), idfCalc, 10000, false);
visitor.setMinTermFreq(0);
ConcordanceArrayWindowSearcher searcher = new ConcordanceArrayWindowSearcher();
SpanQuery q = new SpanTermQuery(new Term(FIELD, "d"));
searcher.search(indexSearcher, FIELD, q, null, analyzer, visitor,
new IndexIdDocIdBuilder());
List<TermIDF> results = ((CooccurVisitor) visitor).getResults();`
If there's any interest in integrating this into Elastic, let me know.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.