To understand the analysis process


(terrasacer) #1

Hi all.

I'm trying to understand the analysis process. Especially during query time.

For example we have a field configured to be not analyzed. The value of this field is Albert Einstein.

If I search "Albert" does not match document with the match query but If I use the match_phrase_prefix query document is returned.

Why?


(Patrick Kik) #2

What I understand from https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html#_match_phrase_prefix is that your query acts as a prefix.

You could test that by querying for "Einstein". Then both the match query and the match_phrase_prefix query would return nothing.

Could you post your findings please?


(terrasacer) #3

Hi @PatrickKik,

First, sorry for late reply.

I'm trying to understand that: Is not_analyzed means that without transforming/converting the text to be stored?

If this is true, how match_phrase_prefix query returns results?

Does it analyzed text again during the query time?


(terrasacer) #4

If someone has an idea, how match_phrase_prefix query returns results in the above example?


(Doug Turnbull) #5

According to here, match phrase prefix does a prefix query on the last term in the query. With not_analyzed, the whole string is taken as a term. Therefore the "last term" is [Albert Einstein]. Albert is a prefix of this term, therefore it matches.

Most other search queries are exact term matches. A query term needs to match a document's term exactly after analysis is run to match.

This blog post might be a good primer for you.


(terrasacer) #6

Hi @softwaredoug,

Thank you for this informative answer and blog post.

Therefore the "last term" is [Albert Einstein]. Albert is a prefix of this term, therefore it matches.

I understand that: The string [Albert Einstein] transformed into two sub-string ["Albert", "Einstein"] by the match_phrase_prefix query in search time.

Is that correct?


(Doug Turnbull) #7

The search string is not_analyzed as well. The search engine takes the whole token [Albert Einstein] and treats it as a word. There's no extra "substring" involved. [Albert] is a prefix of the larger term [Albert Einstein].


(system) #8