JSON Parse error

I get a parse error when trying to index a JSON file. I am running this from the Python API for ES and, as I step deep into the code, it appears to be based on the format that Tika generates. I have a very simple TXT file (with a few sentences in it) which I ran through Tika converted to JSON. When I check at jsonlint it shows as valid JSON. After stepping into the code, it appears to be an issue with the fact that Tika uses square brackets at the beginning and end of the file rather than curly braces. So, it is valid JSON but chokes when I try to index in ES. I would attach the file but it is not an allowed file type.

Has anyone had this issue with Tika? It is such a ubiquitous tool that I find it odd that the file will not index easily. Is there another tool that I can use that will convert many file types (as many as Tika) that does work? Just so it is here, I am using sample code provided by Elastic PY here (endpoint URL omitted); error occurs at res.index:

es = Elasticsearch('https://[omitted].com:30737')
res = requests.get('https:/[omitted]com:30737')

f = open("content.txt.json", "r")
content = json.load(f)
res = es.index(index="json-test-index", doc_type='test', id=2, body=content)

Elasticsearch needs its documents to be a single json object but it sounds
like Tika is spitting out a list. You have to concert it.

Thank you for the idea. First, I thought it was the json.load call that was the culprit since I think that converts the JSON to a list. I performed just a file.read() instead and get the same error. I then imported simplejson and tried json.load with that, same error. Not sure what to do but this is a total show stopper for me. Greatly appreciate any other ideas! (I am pasting the file content below):

[{"Content-Encoding":"windows-1252","Content-Length":"508","Content-Type":"text/plain; charset\u003dwindows-1252","X-Parsed-By":["org.apache.tika.parser.DefaultParser","org.apache.tika.parser.txt.TXTParser"],"X-TIKA:content":"\u003chtml xmlns\u003d"http://www.w3.org/1999/xhtml"\u003e\n\u003chead\u003e\n\u003cmeta name\u003d"X-Parsed-By" content\u003d"org.apache.tika.parser.DefaultParser" /\u003e\n\u003cmeta name\u003d"X-Parsed-By" content\u003d"org.apache.tika.parser.txt.TXTParser" /\u003e\n\u003cmeta name\u003d"tika:file_ext" content\u003d"txt" /\u003e\n\u003cmeta name\u003d"Content-Encoding" content\u003d"windows-1252" /\u003e\n\u003cmeta name\u003d"tika_batch_fs:relative_path" content\u003d"content1.txt" /\u003e\n\u003cmeta name\u003d"resourceName" content\u003d"content1.txt" /\u003e\n\u003cmeta name\u003d"Content-Length" content\u003d"508" /\u003e\n\u003cmeta name\u003d"X-TIKA:digest:MD5" content\u003d"8ac30237b478064c3f595e6f71189728" /\u003e\n\u003cmeta name\u003d"Content-Type" content\u003d"text/plain; charset\u003dwindows-1252" /\u003e\n\u003ctitle\u003e\u003c/title\u003e\n\u003c/head\u003e\n\u003cbody\u003e\u003cp\u003eTry the Elastic Stack 5.0.0-alpha5, and join the Elastic Pioneer Program\r\n\r\nThe 5.0.0-alpha5 release candidates of Elasticsearch, Logstash, Beats, Kibana, and X-Pack are now available to download and test! Learn about the releases here.\r\n\r\nAs you try out the alpha5, be sure to open any issues in the GitHub repos for the projects (for X-Pack, post in the X-Pack category on Discuss). As a thank you for reporting issues, you will receive special gifts and recognition as part of the Elastic Pioneer Program.\u003c/p\u003e\n\u003c/body\u003e\u003c/html\u003e","X-TIKA:digest:MD5":"8ac30237b478064c3f595e6f71189728","X-TIKA:parse_time_millis":"109","resourceName":"content1.txt","tika:file_ext":"txt","tika_batch_fs:relative_path":"content1.txt"}]

Well, the outermost entity is a list. No JSON library in the world is going to say otherwise. Instead of passing the whole parsed entity to ES just pass the first element of the list.

Do you mean pass in everything inside the square brackets? I tried that and I get the same error. However I think you may mean something different.

Do you know of another file conversion tool other than Tika that ES users use often? That might be easier.

Thank you very much!

One more thing. Even though this is a list, it still tests as valid JSON at jsonlnt. Thx.

Do you mean pass in everything inside the square brackets?


I tried that and I get the same error.

Please show the actual error message.

The stack trace is below. I am also attaching two screenshots of where in the code the error occurs (not sure if this provides useful info). Thank you!

PUT /json-test-index/test/2 [status:400 request:0.392s]
Traceback (most recent call last):
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.1.4\helpers\pydev\pydevd.py", line 1531, in
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.1.4\helpers\pydev\pydevd.py", line 938, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.1.4\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Text_Pipeline/Elastic/elastic.py", line 35, in
res = es.index(index="json-test-index", doc_type='test', id=2, body=content)
File "C:\Users\Paul Starret\AppData\Roaming\Python\Python34\site-packages\elasticsearch\client\utils.py", line 69, in wrapped
return func(*args, params=params, **kwargs)
File "C:\Users\Paul Starret\AppData\Roaming\Python\Python34\site-packages\elasticsearch\client_init
.py", line 279, in index
_make_path(index, doc_type, id), params=params, body=body)
File "C:\Users\Paul Starret\AppData\Roaming\Python\Python34\site-packages\elasticsearch\transport.py", line 327, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "C:\Users\Paul Starret\AppData\Roaming\Python\Python34\site-packages\elasticsearch\connection\http_urllib3.py", line 109, in perform_request
self._raise_error(response.status, raw_data)
File "C:\Users\Paul Starret\AppData\Roaming\Python\Python34\site-packages\elasticsearch\connection\base.py", line 113, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'mapper_parsing_exception', 'failed to parse')

Weird version of elasticsearch-py are you using?

I am using Elasticsearch PY version 2.4 with Python 3.4.

Paul Starrett, CEO, Starrett Consulting, Inc.

Licensed Private Investigator and Attorney
Certified Fraud Examiner
EnCase Certified Computer Forensics Examiner (EnCE)
Master of Science in Predictive Analytics (Northwestern U.)