How to query with multiple languages (field per language approach)


(Teemu Nuutinen) #1

Hi,

I have documents which have fields in a few different languages
meaning that in one document field A might be English, field B German
and in another document field A Finnish and field B English. The
documents have other fields which should be searchable too (no
language analysis). I would like to use language analyzers so the
field per language approach should be the best option.

In short, I'm not experienced enough with elasticsearch and my
problems are that:

  1. I do not understand why tokens from language analyzed fields are
    not included in _all and
  2. why is there a strange conflict when using query_string (Multi
    Field) to query the language analyzed fields together with _all (see
    query 3 below)?

Using the following example mapping and record (full sh/json at
https://gist.github.com/2029161#file_setup.sh)...

Properties:

title:
type: string,
index: analyzed,
boost: 2.0
title_eng:
type: string,
index: analyzed,
analyzer: english
title_fin:
type: string,
index: analyzed,
analyzer: finnish
title_ger:
type: string,
index: analyzed,
analyzer: german

Record:

title: Topics on Vagueness,
title_eng: Topics on Vagueness

Tokens produced by english analyzer: topic, vagu

Then doing some searches (full sh/json: https://gist.github.com/2029161#file_queries.sh
)...

  • Query 1: No results, I guess languge specific tokens can't be found
    in _all...

    text:
    _all: topic

  • Query 2: So now I have to query the language analyzed fields
    separately, record is found as expected

    query_string:
    fields:
    - title_eng
    - title_fin
    - title_ger
    - _all
    default_operator: AND
    query: topic

  • Query 3: not found, what's going on?!

    query_string:
    fields:
    - title_eng
    - title_fin
    - title_ger
    - _all
    default_operator: AND
    query: topics on vagueness

  • Query 4: works as expected

    dis_max:
    queries:
    - text:
    _all:
    query: topics on vagueness,
    operator: and
    - text:
    title_eng:
    query: topics on vagueness,
    operator: and
    - text:
    title_fin:
    query: topics on vagueness,
    operator: and
    - text:
    title_ger:
    query: topics on vagueness,
    operator: and

Now, I guess there is no way I can or even should attempt to find
stemmed versions in the _all field? (Finding at least "topic" would be
nice)

In that case given "one search box" the query options are either
query_string or dis_max. But why doesn't query #3 work with title_ger
field? If I remove title_ger field from the query or even replace it
with a field that doesn't exist (foobar) the query works. If I query
"topics vagueness" without "on" it works. Unlike english and finnish
the german analyzer produces "on" token which seems to "break" the
query, but there is no data in that field! I don't get it.

So is query #4 (dis_max) the way to go or will I run into problems
later?

Wow, that turned out to be long, thanks for reading!


(system) #2