ESRally - data.json.bz2

I am writing dummy data file- fakeDate.json using a python script docGen.py as a data file for a custom rally track. On trying to execute the rally I get the following error.

[ERROR] Cannot race. ('Could not execute benchmark', UnicodeDecodeError('utf-8', b'BZh91AY&SY\x97\xc7e\xbe\x00\x14/_\x80\x10P\x07\x7f\xf0?\xff\xff\xf0\xbf\xef\xffja\xf6\xf9M\xb8\xc9:\x91\xa2\xa6\xde\xbcUU\x1f6-eR\xda\x9c\xbd\x04o\xeb^n~h\xeby\xfb0<<k\xa3da,\x8a\x93\xa9\x19x\xd6\xca+\x8dY\x05\xbd\xc4|\x91\xb1\x1f\x92?\xf8\xbb\x92)\xc2\x84\x84\xbe;-\xf0', 10, 11, 'invalid start byte'))

I understand that this has got to do something with the encoding of the extracted JSON file containing documents. I tried setting the environment variables such as LC_ALL=en_US.UTF-8
LC_CTYPE=en_US.UTF-8, however that does not seem to solve the problem.

docGen.py

from faker import Faker
import random
import string
import json
import io

fake = Faker()

def data(records):
for i in range(records):
yield(dict([("id", ''.join(random.choices(string.ascii_uppercase + string.digits, k=32))),
("sessionId", ''.join(random.choices(string.ascii_uppercase + string.digits, k=24)))]))

d = data(10)

with io.open('fakeData.json', 'w', encoding='utf-8') as f:
for record in d:
f.write(json.dumps(record, ensure_ascii=False))
f.write('\n')
print('Done')

Sample: fakeData.json

{"id": "MELU1V867SRTPBVHWKKFIGHEDGJV54DP", "sessionId": "AE5DBUIM0UEDETE78KGAFP2D"}
{"id": "YWHA80Q1J29CXFQX1A2BYSUO3OCOQMJR", "sessionId": "EGF4RH1T3ZG1ZI2V92ZDGTIW"}
{"id": "GVQYOZAL8VCSM5C6UV9QJNLZ9WPC3299", "sessionId": "D458X82TJIFEMO5KUDO2Y8DR"}
{"id": "4NK6QE0E2D4RJBQ61J0D2M5VD5OAAEOC", "sessionId": "6RFR40SIIRLWA1N9FCV2CFH5"}

Any pointers?

Hi @sanketshinde,

I just tried your data generator locally and it works fine. So the issue must be somewhere else. I can see that the string starts with "BZh" which is the file signature for bz2. Did something go wrong when you compressed the file?

You should compress the file with:

bzip2 -9 -c fakeData.json > fakeData.json.bz2

In your track you should only reference the file archive as follows:

    {
      "name": "my-index",
      "types": [
        {
          "name": "docs",
          "mapping": "mappings.json",
          "documents": "fakeData.json.bz2",
          "document-count": 10,
          "compressed-bytes": 541,
          "uncompressed-bytes": 840
        }
      ]
    }

Daniel

1 Like

Thanks for the effort.

This was a very convoluted and a bit misleading error. Firstly the error message was displayed right after ESRally attempted to extract the zipped data file, causing me to focus on json encodings from python. Thereafter on further investigation in the log files. The error is documented to occur right after Rally attempts to make the index with a name specified in the track.json file, causing me to think that this was something to do with index nomenclature, which also wasn't the case. Finally I just checked all the files manually and found that it was the mapping file that had some weird things in except what was actually expected. However nowhere in the log file does it explicitly mention that this was in the mapping file.

I simply refreshed the contents of the mapping file with relevant ones and I could run a rally successfully.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.