Annotated Text Plugin: full-text queries for annotated_text

I would like to use a full-text query on annotated text. A similar question was asked last year, but didn't receive any responses.
Just as that poster, I want to be able to do the following:
if I have the annotated text

Text1: "[Thomas Jefferson](_president_) was born in [Virginia](_place_)"
Text2: "[Thomas Jefferson](_writer_) was born in [Virginia](_place_)."  

I'd like to be able to execute the query

    "match_phrase": {
        "annotatedField": "_president_ was born in Virginia"

and have it match Text1 and but not Text2.
Is there any way to do this with annotated text? If not, is there a way to do it with payloads?

The problem you are facing is that the match_phrase query clause will feed your input through an analyzer which will probably strip the underscores off your ‘president’ and fail to match. Your search string has to be presented in a way where the terms are not tokenized which will involve more JSON - see the Span or Interval query clauses.

Thanks for the reply! I just tried adding the annotations without the underscores, and it worked. Thanks for pointing me in the right direction!

1 Like

It looks like match_phrase doesn't work well when the annotated text contains spaces.
If I have the following:

Text1: "when [Thomas Jefferson](ispresident) was born in [Virginia](isplace)"
Text2: "when [Jefferson](ispresident) was born in [Virginia](isplace)." 

then this matches Text2, but not Text1:

    "match_phrase": {
        "annotatedField": "when ispresident was born in Virginia"

I would expect it to match both. Any insights?

In text1 the “ispresident” token is anchored to the same position as the first token it annotates- so in this case “Thomas” and not “Jefferson”. Adding a small slop factor to the query will help allow for this gap

Unfortunately a slop wouldn't work well for my purposes. My annotated text is up to 5 words long. A slop that large would cause many unwanted matches.
I can think of two possible approaches around this issue. I would appreciate some feedback on the feasibility of these.

  1. Tokenize the annotated text as a single token. if my understanding is right, if "Thomas Jefferson" were a keyword token, then it would occupy the same position as ispresident and my query would match. Is there any way to mark certain phrases as keywords during the tokenization process? In my case, all instances of "Thomas Jefferson" would be keywords and I know all possible keywords at the outset. If I could turn them into single tokens at index/query time it seems like it should solve my issue?

  2. Create isPresident annotations of differing lengths. This idea is pretty hacky, but since I have a finite number of phrases that could be annotated with isPresident, I could create "isPresident0", "ispresident0 ispresident1", and "ispresident0 ispresident1 ispresident2" annotations to match texts of different lengths. I could then expand my query on that field to turn ispresident into all of those alternatives. How are annotations tokenized? Will multi word annotations take up multiple token positions?

Tokenize the annotated text as a single token

That approach would preclude users searching for "Thomas Jefferson" or just "Jefferson" and matching that text.
Unlike synonyms, annotations are not an index-wide policy definition attached to an analyzer. They are overlays on selected pieces of text so that not all Thomas Jeffersons have to be presidents.

You can use the _analyze api to see that. There's an example in this blog

Nope. They're always anchored to the same position as the first token they annotate. Practically speaking we had to pick the annotation's position as either the first or last token and we chose the former. We could have positioned an instance of the annotation over both or maybe every token in the covered text but that would have messed with term frequency (TF) scoring and any searches to find presidents near mentions of other presidents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.