ESRally - data.json.bz2

sanketshinde · September 14, 2017, 2:16pm

I am writing dummy data file- fakeDate.json using a python script docGen.py as a data file for a custom rally track. On trying to execute the rally I get the following error.

[ERROR] Cannot race. ('Could not execute benchmark', UnicodeDecodeError('utf-8', b'BZh91AY&SY\x97\xc7e\xbe\x00\x14/_\x80\x10P\x07\x7f\xf0?\xff\xff\xf0\xbf\xef\xffja\xf6\xf9M\xb8\xc9:\x91\xa2\xa6\xde\xbcUU\x1f6-eR\xda\x9c\xbd\x04o\xeb^n~h\xeby\xfb0<<k\xa3da,\x8a\x93\xa9\x19x\xd6\xca+\x8dY\x05\xbd\xc4|\x91\xb1\x1f\x92?\xf8\xbb\x92)\xc2\x84\x84\xbe;-\xf0', 10, 11, 'invalid start byte'))

I understand that this has got to do something with the encoding of the extracted JSON file containing documents. I tried setting the environment variables such as LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8, however that does not seem to solve the problem.

docGen.py

from faker import Faker
import random
import string
import json
import io

fake = Faker()

def data(records):
for i in range(records):
yield(dict([("id", ''.join(random.choices(string.ascii_uppercase + string.digits, k=32))),
("sessionId", ''.join(random.choices(string.ascii_uppercase + string.digits, k=24)))]))

d = data(10)

with io.open('fakeData.json', 'w', encoding='utf-8') as f:
for record in d:
f.write(json.dumps(record, ensure_ascii=False))
f.write('\n')
print('Done')

Sample: fakeData.json

{"id": "MELU1V867SRTPBVHWKKFIGHEDGJV54DP", "sessionId": "AE5DBUIM0UEDETE78KGAFP2D"}
{"id": "YWHA80Q1J29CXFQX1A2BYSUO3OCOQMJR", "sessionId": "EGF4RH1T3ZG1ZI2V92ZDGTIW"}
{"id": "GVQYOZAL8VCSM5C6UV9QJNLZ9WPC3299", "sessionId": "D458X82TJIFEMO5KUDO2Y8DR"}
{"id": "4NK6QE0E2D4RJBQ61J0D2M5VD5OAAEOC", "sessionId": "6RFR40SIIRLWA1N9FCV2CFH5"}

Any pointers?

danielmitterdorfer · September 14, 2017, 2:53pm

Hi @sanketshinde,

I just tried your data generator locally and it works fine. So the issue must be somewhere else. I can see that the string starts with "BZh" which is the file signature for bz2. Did something go wrong when you compressed the file?

You should compress the file with:

bzip2 -9 -c fakeData.json > fakeData.json.bz2

In your track you should only reference the file archive as follows:

    {
      "name": "my-index",
      "types": [
        {
          "name": "docs",
          "mapping": "mappings.json",
          "documents": "fakeData.json.bz2",
          "document-count": 10,
          "compressed-bytes": 541,
          "uncompressed-bytes": 840
        }
      ]
    }

Daniel

sanketshinde · September 15, 2017, 11:05am

Thanks for the effort.

This was a very convoluted and a bit misleading error. Firstly the error message was displayed right after ESRally attempted to extract the zipped data file, causing me to focus on json encodings from python. Thereafter on further investigation in the log files. The error is documented to occur right after Rally attempts to make the index with a name specified in the track.json file, causing me to think that this was something to do with index nomenclature, which also wasn't the case. Finally I just checked all the files manually and found that it was the mapping file that had some weird things in except what was actually expected. However nowhere in the log file does it explicitly mention that this was in the mapping file.

I simply refreshed the contents of the mapping file with relevant ones and I could run a rally successfully.

system · October 13, 2017, 11:05am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Execute json file of "query and painless script" from the python script Elasticsearch	4	1720	December 25, 2018
File structure cannot be determined on data visualizer Kibana	2	1903	August 19, 2021
"reason": "Unrecognized character escape '[' Kibana	3	1369	December 5, 2018
Elasticsearch Nodes Crash After Attempt to Create Scripted Filed Elasticsearch	4	400	March 6, 2021
ES Choking on Seemingly Valid JSON Elasticsearch	3	739	July 6, 2017

ESRally - data.json.bz2

docGen.py

Sample: fakeData.json

Related topics