Japanese autocomplete with App Search

zkirill · June 8, 2021, 12:34am

I am performing search query tests as described in Implementing Japanese autocomplete suggestions in Elasticsearch | Elastic Blog but not getting any of the expected results.

What does "language optimization" mean for App Search? Does it use the recommended plugins?

Irina_Truong · June 10, 2021, 7:39pm

When you create a language-specific engine, AppSearch applies language-specific analyzers to the index containing engine's documents.

As an example, I create an engine called japanese-test with data from your article. Documents are stored in an index called .ent-search-engine-documents-japanese-test. The index has the following analysis settings:

{
          "filter" : {
            "front_ngram" : {
              "type" : "edge_ngram",
              "min_gram" : "1",
              "max_gram" : "12"
            },
            "bigram_joiner" : {
              "max_shingle_size" : "2",
              "token_separator" : "",
              "output_unigrams" : "false",
              "type" : "shingle"
            },
            "bigram_max_size" : {
              "type" : "length",
              "max" : "16",
              "min" : "0"
            },
            "bigram_joiner_unigrams" : {
              "max_shingle_size" : "2",
              "token_separator" : "",
              "output_unigrams" : "true",
              "type" : "shingle"
            },
            "delimiter" : {
              "split_on_numerics" : "true",
              "generate_word_parts" : "true",
              "preserve_original" : "false",
              "catenate_words" : "true",
              "generate_number_parts" : "true",
              "catenate_all" : "true",
              "split_on_case_change" : "true",
              "type" : "word_delimiter_graph",
              "catenate_numbers" : "true",
              "stem_english_possessive" : "true"
            },
            "ja-stop-words-filter" : {
              "type" : "stop",
              "stopwords" : "_english_"
            },
            "ja-stem-filter" : {
              "name" : "light_english",
              "type" : "stemmer"
            }
          },
          "analyzer" : {
            "i_prefix" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "front_ngram"
              ],
              "tokenizer" : "standard"
            },
            "iq_text_delimiter" : {
              "filter" : [
                "delimiter",
                "cjk_width",
                "lowercase",
                "asciifolding",
                "ja-stop-words-filter",
                "ja-stem-filter",
                "cjk_bigram"
              ],
              "tokenizer" : "whitespace"
            },
            "q_prefix" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding"
              ],
              "tokenizer" : "standard"
            },
            "iq_text_base" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "ja-stop-words-filter"
              ],
              "tokenizer" : "standard"
            },
            "iq_text_stem" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "ja-stop-words-filter",
                "ja-stem-filter",
                "cjk_bigram"
              ],
              "tokenizer" : "standard"
            },
            "i_text_bigram" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "ja-stem-filter",
                "bigram_joiner",
                "bigram_max_size"
              ],
              "tokenizer" : "standard"
            },
            "q_text_bigram" : {
              "filter" : [
                "cjk_width",
                "lowercase",
                "asciifolding",
                "ja-stem-filter",
                "bigram_joiner_unigrams",
                "bigram_max_size"
              ],
              "tokenizer" : "standard"
            }
          }
        }

and here are the mappings for my_field:

{
          "type" : "text",
          "fields" : {
            "date" : {
              "type" : "date",
              "format" : "strict_date_time||strict_date",
              "ignore_malformed" : true
            },
            "delimiter" : {
              "type" : "text",
              "index_options" : "freqs",
              "analyzer" : "iq_text_delimiter"
            },
            "enum" : {
              "type" : "keyword",
              "ignore_above" : 2048
            },
            "float" : {
              "type" : "double",
              "ignore_malformed" : true
            },
            "joined" : {
              "type" : "text",
              "index_options" : "freqs",
              "analyzer" : "i_text_bigram",
              "search_analyzer" : "q_text_bigram"
            },
            "location" : {
              "type" : "geo_point",
              "ignore_malformed" : true,
              "ignore_z_value" : false
            },
            "prefix" : {
              "type" : "text",
              "index_options" : "docs",
              "analyzer" : "i_prefix",
              "search_analyzer" : "q_prefix"
            },
            "stem" : {
              "type" : "text",
              "analyzer" : "iq_text_stem"
            }
          },
          "index_options" : "freqs",
          "analyzer" : "iq_text_base"
        }

You can already see that analysis settings in AppSearch are not the same as in your article. Moreover, in the article, the queries that are used are not simple search queries. They aggregate the search results, and sort them by number of occurrences. You see how when you search for 日本, it comes up on top, because there are 6 documents with that exact text? In App Search, when you use query tester to search for a substring, it does no do aggregations.

I hopes this helps.

zkirill · June 10, 2021, 8:00pm

Thank you very much, Irina! I am beginning to understand.

Is it safe then to assume that we would not be able to achieve the same result with App Search as described in the blog post that I linked?

Also, based on the analysis settings it looks like no additional Elasticsearch plugins need to be selected while creating an Elastic deployment in order to use a language-specific App Search engine. This wasn't clear to us in the documentation but makes sense now.

system · July 8, 2021, 8:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Adding support for multi-language partial matching querying Elasticsearch language-clients	2	441	November 22, 2023
What are the best approach for Chinese/Japanese language indexing and searching? Elasticsearch	6	11916	July 5, 2017
How to implement language analyzers Elastic Search elastic-app-search	2	338	April 17, 2022
How to normalize Japanese? Elasticsearch	4	2286	July 6, 2017
Need Help with Japanese analyzer - (Kuromoji) Elasticsearch	1	363	July 6, 2017

Japanese autocomplete with App Search

Related topics