I would like to use a full-text query on annotated text. A similar question was asked last year, but didn't receive any responses.
Just as that poster, I want to be able to do the following:
if I have the annotated text
Text1: "[Thomas Jefferson](_president_) was born in [Virginia](_place_)"
Text2: "[Thomas Jefferson](_writer_) was born in [Virginia](_place_)."
I'd like to be able to execute the query
{
"match_phrase": {
"annotatedField": "_president_ was born in Virginia"
}
}
and have it match Text1 and but not Text2.
Is there any way to do this with annotated text? If not, is there a way to do it with payloads?
Thanks!
The problem you are facing is that the match_phrase query clause will feed your input through an analyzer which will probably strip the underscores off your ‘president’ and fail to match. Your search string has to be presented in a way where the terms are not tokenized which will involve more JSON - see the Span or Interval query clauses.
In text1 the “ispresident” token is anchored to the same position as the first token it annotates- so in this case “Thomas” and not “Jefferson”. Adding a small slop factor to the query will help allow for this gap
Unfortunately a slop wouldn't work well for my purposes. My annotated text is up to 5 words long. A slop that large would cause many unwanted matches.
I can think of two possible approaches around this issue. I would appreciate some feedback on the feasibility of these.
Tokenize the annotated text as a single token. if my understanding is right, if "Thomas Jefferson" were a keyword token, then it would occupy the same position as ispresident and my query would match. Is there any way to mark certain phrases as keywords during the tokenization process? In my case, all instances of "Thomas Jefferson" would be keywords and I know all possible keywords at the outset. If I could turn them into single tokens at index/query time it seems like it should solve my issue?
Create isPresident annotations of differing lengths. This idea is pretty hacky, but since I have a finite number of phrases that could be annotated with isPresident, I could create "isPresident0", "ispresident0 ispresident1", and "ispresident0 ispresident1 ispresident2" annotations to match texts of different lengths. I could then expand my query on that field to turn ispresident into all of those alternatives. How are annotations tokenized? Will multi word annotations take up multiple token positions?
Thanks!
That approach would preclude users searching for "Thomas Jefferson" or just "Jefferson" and matching that text.
Unlike synonyms, annotations are not an index-wide policy definition attached to an analyzer. They are overlays on selected pieces of text so that not all Thomas Jeffersons have to be presidents.
You can use the _analyze api to see that. There's an example in this blog
Nope. They're always anchored to the same position as the first token they annotate. Practically speaking we had to pick the annotation's position as either the first or last token and we chose the former. We could have positioned an instance of the annotation over both or maybe every token in the covered text but that would have messed with term frequency (TF) scoring and any searches to find presidents near mentions of other presidents.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.