Unicode characters in documents are not properly loaded in integ tests

#1

Hi there,

I have an integ test derived from ESIntegTestCase where I index a document with Chinese characters in some fields, e.g.:

POST /account/account
{
    "account_number" : 6666,
    "balance" : 1515,
    "firstname" : "盛虹",
    "lastname" : "Last",
    "age" : 32,
    "gender" : "M",
    "address" : "4587 Some Corridor",
    "employer" : "Some company",
    "email" : "someone@gmail.com",
    "city" : "Beijing",
    "state" : "CN"
}

And then I just search for this document by account_number:

GET account/_search
{
    "query": {
        "term": {
            "account_number": {
                "value": "6666",
                "boost": 1.0
            }
        }
    }
}

The result is the following (note the firstname field value):

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "account",
        "_type" : "account",
        "_id" : "iyb5vWoBQx5pa0FSu3t7",
        "_score" : 1.0,
        "_source" : {
          "account_number" : 6666,
          "balance" : 1515,
          "firstname" : "τ\u203a\u203aΦÖ╣",
          "lastname" : "Last",
          "age" : 32,
          "gender" : "M",
          "address" : "4587 Some Corridor",
          "employer" : "Some company",
          "email" : "someone@gmail.com",
          "city" : "Beijing",
          "state" : "CN"
        }
      }
    ]
  }
}

Now the interesting part is that this result I get on my windows machine. On linux the test passes, meaning whatever I stored the same value I get back. What is even more interesting, when I run OSS elasticsearch manually on my windows machine, and try to manually do the steps in the test, I get back the correct value in the search result.

I am stuck, and not sure how to proceed. I tried removing the file.encoding option from jvm.options in my local cluster (on windows), but that doesn't change anything. Initially I was suspecting this has something to do with windows filesystem encoding, but then how come manual test works fine, perhaps my local cluster comes with different settings which override something that the integ test cluster does not.

Here is the mapping for the account index used in the test:

{
  "settings" : {
    "number_of_shards" : 1
  },
  "mappings" : {
    "account" : {
      "properties" : {
        "gender" : {
          "type" : "text",
          "fielddata" : true
        },
        "address" : {
          "type" : "text",
          "fielddata" : true
        },
        "state" : {
          "type" : "text",
          "fielddata" : true,
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          } 
        }
      }
    }
  }
}
#2

Alright, never mind, found the problem. I was opening a FileStream to read the documents to index, and did not specify the encoding to use, that's why it depended on which environment I ran it. I explicitly specified UTF_8 and the test passes now.