Adding support for multi-language partial matching querying

Hi all,

I'm trying to add support for partial-matching search on certain fields that may contain text in multiple languages. Specifically I currently lack support for Japanese, but IIUC - same applies for Cyrillic and Chinese.
Current analyzers rely on whitespace mainly, so I guess that's why it doesn't work for those alphabet languages.

If I understand correctly - I'll have to implement something like what's mentioned here:

I have few questions about it:

  1. is this really what I need in order to allow querying with partial matching for Japanese, for instance ? (I mean - if it does more - maybe I can do less, and save on storage and index/query performance)
  2. is it possible to apply a specific analyzer (of a specific sub-field) only for a specific language of the indexed documents ? I mean - say Japanese covers ~5% of my documents traffic, do I have to have a dedicated subfield (with dedicated analyzers) that will take up additional storage for the entire non-Japanese documents as well ? or can it be optimized to save on that storage somehow ?

Many thanks in advance!
Shachar

In case it might anyone else stumbling upon this -
I've decided to perform the selective insert to new field within our application.. i.e. the application will determine if the subjected field's language is any of the CJK languages, and duplicate the value into a new field that has those specific analyzers that match that language.

I will soon update here the technical details (script I used to update the mapping, and maybe some snippet of the application language detection logic (using langdetect python lib).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.