Unicode characters in documents are not properly loaded in integ tests

arsenmk · May 16, 2019, 12:43am

Hi there,

I have an integ test derived from ESIntegTestCase where I index a document with Chinese characters in some fields, e.g.:

POST /account/account
{
    "account_number" : 6666,
    "balance" : 1515,
    "firstname" : "盛虹",
    "lastname" : "Last",
    "age" : 32,
    "gender" : "M",
    "address" : "4587 Some Corridor",
    "employer" : "Some company",
    "email" : "someone@gmail.com",
    "city" : "Beijing",
    "state" : "CN"
}

And then I just search for this document by account_number:

GET account/_search
{
    "query": {
        "term": {
            "account_number": {
                "value": "6666",
                "boost": 1.0
            }
        }
    }
}

The result is the following (note the firstname field value):

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "account",
        "_type" : "account",
        "_id" : "iyb5vWoBQx5pa0FSu3t7",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 6666,
          "balance" : 1515,
          "firstname" : "τ\u203a\u203aΦÖ╣",
          "lastname" : "Last",
          "age" : 32,
          "gender" : "M",
          "address" : "4587 Some Corridor",
          "employer" : "Some company",
          "email" : "someone@gmail.com",
          "city" : "Beijing",
          "state" : "CN"
        }
      }
    ]
  }
}

Now the interesting part is that this result I get on my windows machine. On linux the test passes, meaning whatever I stored the same value I get back. What is even more interesting, when I run OSS elasticsearch manually on my windows machine, and try to manually do the steps in the test, I get back the correct value in the search result.

I am stuck, and not sure how to proceed. I tried removing the file.encoding option from jvm.options in my local cluster (on windows), but that doesn't change anything. Initially I was suspecting this has something to do with windows filesystem encoding, but then how come manual test works fine, perhaps my local cluster comes with different settings which override something that the integ test cluster does not.

Here is the mapping for the account index used in the test:

{
  "settings" : {
    "number_of_shards" : 1
  },
  "mappings" : {
    "account" : {
      "properties" : {
        "gender" : {
          "type" : "text",
          "fielddata" : true
        },
        "address" : {
          "type" : "text",
          "fielddata" : true
        },
        "state" : {
          "type" : "text",
          "fielddata" : true,
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          } 
        }
      }
    }
  }
}

arsenmk · May 16, 2019, 9:13pm

Alright, never mind, found the problem. I was opening a FileStream to read the documents to index, and did not specify the encoding to use, that's why it depended on which environment I ran it. I explicitly specified UTF_8 and the test passes now.

system · June 13, 2019, 9:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Chinese character Elasticsearch	2	694	July 6, 2017
Content encoding issues Elasticsearch	4	1291	July 6, 2017
Getting characters to be stored correctly Elasticsearch	6	1418	July 5, 2017
Smart Chinese Analysis returns unicodes instead of chinese tokens Elasticsearch	6	1231	July 5, 2017
How does the ES to handle strings with different encodings? Elasticsearch	2	3864	July 6, 2017

Unicode characters in documents are not properly loaded in integ tests

Related topics