Different search behavior between ascii code and multibyte character

YuWatanabe · December 26, 2016, 9:41am

Elastic verision : elasticsearch 5.1.1

I am testing the match query result between ascii code and multibyte character to decide what analyzer to use for our upcoming deployment. I would like to know if below behavior is a default behavior. Both cases use whitespace analyzer.

When full ascii code text is indexed, usually documents that match the search terms and inverted index are returned. So if I index,

"This is a pen"

This will be analyzed as below.

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "pen",
      "start_offset": 10,
      "end_offset": 13,
      "type": "word",
      "position": 3
    }
  ]
}

I can query as

GET sample2/sample/_search
{
  "query" : {
    "match" : {
      "text" : "This"
    }
  }
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "sample2",
        "_type": "sample",
        "_id": "AVk6XNwSbyuCzhs6qfpG",
        "_score": 0.2876821,
        "_source": {
          "text": "This is a pen"
        }
      }
      }
    ]
  }
}

But when I index Japanese multibyte characters,

関西国際空港に着陸しました。,

The default behavior is partial match.

GET sample/sample/_search
{
  "query" : {
    "match" : {
      "text" : "関西"
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.98382175,
    "hits": [
      {
        "_index": "sample",
        "_type": "sample",
        "_id": "AVk6WrlobyuCzhs6qfpE",
        "_score": 0.98382175,
        "_source": {
          "text": "関西国際空港に着陸しました。"
        }
      }
    ]
  }
}

Is this default behavior for multibyte characters specifically for Japanese?

YuWatanabe · December 26, 2016, 10:02am

This partial also occurs when I use the kuromoji_analyzer with normal mode.

I have created the index setting as below.

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "normal",
            "discard_punctuation": "false"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

Index the document as below.

POST kuromoji_sample/sample
{
  "text" : "関西国際空港に着陸しました"
}

Match query returns as partial match even though I do not use the wild card.

GET kuromoji_sample/_search
{
  "query" : {
    "match" : {
      "text" : "関西"
    }
  }
}

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5257321,
    "hits": [
      {
        "_index": "kuromoji_sample",
        "_type": "sample",
        "_id": "AVk6kZBDbyuCzhs6qfpa",
        "_score": 0.5257321,
        "_source": {
          "text": "関西国際空港に着陸しました"
        }
      }
    ]
  }
}

jprante · December 26, 2016, 12:47pm

Please study how analysis works with the _analyze endpoint.

https://www.elastic.co/guide/en/elasticsearch/reference/5.x/indices-analyze.html

If you wonder why Elasticsearch/Lucene handles this tokenization by default , the reason is the standard tokenizer follows the Unicode segmentation rules http://unicode.org/reports/tr29/ just like icu_tokenizer of the ICU plugin. For the difference see https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html

For example, using icu_tokenizer (which should be preferred for east asian languages where correct tokenization is not easy) you can use

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
             "tokenizer" : {
                 "my_icu" : {
                     "type" : "icu_tokenizer"
                 }
             },
            "analyzer": {
               "my_analyzer": {
                  "type": "custom",
                  "tokenizer" : "my_icu"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_analyzer"
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "関西国際空港に着陸しました"
}

GET /test/_analyze
{
    "analyzer" : "my_analyzer",
    "text" : "関西国際空港に着陸しました"
}

The result shows the following ideographic tokens

{
   "tokens": [
      {
         "token": "関西",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<IDEOGRAPHIC>",
         "position": 0
      },
      {
         "token": "国際",
         "start_offset": 2,
         "end_offset": 4,
         "type": "<IDEOGRAPHIC>",
         "position": 1
      },
      {
         "token": "空港",
         "start_offset": 4,
         "end_offset": 6,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "に",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "着陸",
         "start_offset": 7,
         "end_offset": 9,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "しま",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      },
      {
         "token": "した",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 6
      }
   ]
}

国際 is not a partial match, it is "Kansai", a name for a region in Japan, so searching and matching it makes perfect sense.

YuWatanabe · December 27, 2016, 8:37am

@jprante

Thanks for the reply.

By reading your reply , I found that I was wrong with my field setting. Apparently , I was not specifying the analyzer for my field which end up using the default analyzer (standard analyzer).

At index time, if no analyzer has been specified, it looks for an analyzer in the index settings called default. Failing that, it defaults to using the standard analyzer.

After specifying, my_analyzer for field level analyzer, analyzer was applied correctly.

system · January 24, 2017, 8:38am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Two characters consider as one Elasticsearch	1	455	June 8, 2019
Getting Accented Text Indexed Properly Elasticsearch	4	1130	July 5, 2017
Indexing non-English text Elasticsearch	11	2782	July 6, 2017
Combo analyzer - Issue with English and Japanese text being stored in same fields Elasticsearch	5	1756	July 6, 2017
Confused about when and how asciifolding happens Elasticsearch	1	306	July 6, 2017

Different search behavior between ascii code and multibyte character

Related topics