Hi,
I have documents which have fields in a few different languages
meaning that in one document field A might be English, field B German
and in another document field A Finnish and field B English. The
documents have other fields which should be searchable too (no
language analysis). I would like to use language analyzers so the
field per language approach should be the best option.
In short, I'm not experienced enough with elasticsearch and my
problems are that:
- I do not understand why tokens from language analyzed fields are
not included in _all and - why is there a strange conflict when using query_string (Multi
Field) to query the language analyzed fields together with _all (see
query 3 below)?
Using the following example mapping and record (full sh/json at
https://gist.github.com/2029161#file_setup.sh)...
Properties:
title:
type: string,
index: analyzed,
boost: 2.0
title_eng:
type: string,
index: analyzed,
analyzer: english
title_fin:
type: string,
index: analyzed,
analyzer: finnish
title_ger:
type: string,
index: analyzed,
analyzer: german
Record:
title: Topics on Vagueness,
title_eng: Topics on Vagueness
Tokens produced by english analyzer: topic, vagu
Then doing some searches (full sh/json: https://gist.github.com/2029161#file_queries.sh
)...
-
Query 1: No results, I guess languge specific tokens can't be found
in _all...text:
_all: topic -
Query 2: So now I have to query the language analyzed fields
separately, record is found as expectedquery_string:
fields:
- title_eng
- title_fin
- title_ger
- _all
default_operator: AND
query: topic -
Query 3: not found, what's going on?!
query_string:
fields:
- title_eng
- title_fin
- title_ger
- _all
default_operator: AND
query: topics on vagueness -
Query 4: works as expected
dis_max:
queries:
- text:
_all:
query: topics on vagueness,
operator: and
- text:
title_eng:
query: topics on vagueness,
operator: and
- text:
title_fin:
query: topics on vagueness,
operator: and
- text:
title_ger:
query: topics on vagueness,
operator: and
Now, I guess there is no way I can or even should attempt to find
stemmed versions in the _all field? (Finding at least "topic" would be
nice)
In that case given "one search box" the query options are either
query_string or dis_max. But why doesn't query #3 work with title_ger
field? If I remove title_ger field from the query or even replace it
with a field that doesn't exist (foobar) the query works. If I query
"topics vagueness" without "on" it works. Unlike english and finnish
the german analyzer produces "on" token which seems to "break" the
query, but there is no data in that field! I don't get it.
So is query #4 (dis_max) the way to go or will I run into problems
later?
Wow, that turned out to be long, thanks for reading!