Different search behavior between ascii code and multibyte character


(Yu Watanabe) #1

Elastic verision : elasticsearch 5.1.1

I am testing the match query result between ascii code and multibyte character to decide what analyzer to use for our upcoming deployment. I would like to know if below behavior is a default behavior. Both cases use whitespace analyzer.

When full ascii code text is indexed, usually documents that match the search terms and inverted index are returned. So if I index,

"This is a pen"

This will be analyzed as below.

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "a",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 2
    },
    {
      "token": "pen",
      "start_offset": 10,
      "end_offset": 13,
      "type": "word",
      "position": 3
    }
  ]
}

I can query as

GET sample2/sample/_search
{
  "query" : {
    "match" : {
      "text" : "This"
    }
  }
}

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "sample2",
        "_type": "sample",
        "_id": "AVk6XNwSbyuCzhs6qfpG",
        "_score": 0.2876821,
        "_source": {
          "text": "This is a pen"
        }
      }
      }
    ]
  }
}

But when I index Japanese multibyte characters,

関西国際空港に着陸しました。,

The default behavior is partial match.

GET sample/sample/_search
{
  "query" : {
    "match" : {
      "text" : "関西"
    }
  }
}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.98382175,
    "hits": [
      {
        "_index": "sample",
        "_type": "sample",
        "_id": "AVk6WrlobyuCzhs6qfpE",
        "_score": 0.98382175,
        "_source": {
          "text": "関西国際空港に着陸しました。"
        }
      }
    ]
  }
}

Is this default behavior for multibyte characters specifically for Japanese?


(Yu Watanabe) #2

This partial also occurs when I use the kuromoji_analyzer with normal mode.

I have created the index setting as below.

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "normal",
            "discard_punctuation": "false"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

Index the document as below.

POST kuromoji_sample/sample
{
  "text" : "関西国際空港に着陸しました"
}

Match query returns as partial match even though I do not use the wild card.

GET kuromoji_sample/_search
{
  "query" : {
    "match" : {
      "text" : "関西"
    }
  }
}

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5257321,
    "hits": [
      {
        "_index": "kuromoji_sample",
        "_type": "sample",
        "_id": "AVk6kZBDbyuCzhs6qfpa",
        "_score": 0.5257321,
        "_source": {
          "text": "関西国際空港に着陸しました"
        }
      }
    ]
  }
}

(Jörg Prante) #3

Please study how analysis works with the _analyze endpoint.

https://www.elastic.co/guide/en/elasticsearch/reference/5.x/indices-analyze.html

If you wonder why Elasticsearch/Lucene handles this tokenization by default , the reason is the standard tokenizer follows the Unicode segmentation rules http://unicode.org/reports/tr29/ just like icu_tokenizer of the ICU plugin. For the difference see https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html

For example, using icu_tokenizer (which should be preferred for east asian languages where correct tokenization is not easy) you can use

PUT /test
{
   "settings": {
      "index": {
         "analysis": {
             "tokenizer" : {
                 "my_icu" : {
                     "type" : "icu_tokenizer"
                 }
             },
            "analyzer": {
               "my_analyzer": {
                  "type": "custom",
                  "tokenizer" : "my_icu"
               }
            }
         }
      }
   },
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "analyzer": "my_analyzer"
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "関西国際空港に着陸しました"
}

GET /test/_analyze
{
    "analyzer" : "my_analyzer",
    "text" : "関西国際空港に着陸しました"
}

The result shows the following ideographic tokens

{
   "tokens": [
      {
         "token": "関西",
         "start_offset": 0,
         "end_offset": 2,
         "type": "<IDEOGRAPHIC>",
         "position": 0
      },
      {
         "token": "国際",
         "start_offset": 2,
         "end_offset": 4,
         "type": "<IDEOGRAPHIC>",
         "position": 1
      },
      {
         "token": "空港",
         "start_offset": 4,
         "end_offset": 6,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "に",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "着陸",
         "start_offset": 7,
         "end_offset": 9,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "しま",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      },
      {
         "token": "した",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 6
      }
   ]
}

国際 is not a partial match, it is "Kansai", a name for a region in Japan, so searching and matching it makes perfect sense.


(Yu Watanabe) #4

@jprante

Thanks for the reply.

By reading your reply , I found that I was wrong with my field setting. Apparently , I was not specifying the analyzer for my field which end up using the default analyzer (standard analyzer).

https://www.elastic.co/guide/en/elasticsearch/reference/5.1/analysis.html

At index time, if no analyzer has been specified, it looks for an analyzer in the index settings called default. Failing that, it defaults to using the standard analyzer.

After specifying, my_analyzer for field level analyzer, analyzer was applied correctly.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.