I cannot figure out how to send json data via python for posting data to jobs. The documentation is not clear on the format the json file needs to be in. I've tried many different options, but I continue getting different errors.
Here is an example of a json doc saved as file_name.json:
from elasticsearch import Elasticsearch
from elasticsearch.client.xpack import MlClient
es = elastic_connection()
es_ml = MlClient(es)
def post_training_data(directory='Training Data', file_name='file_name.json'):
with open(os.path.join(directory, file_name), mode='r') as train_file:
train_data = json.load(train_file)
es_ml.post_data(job_id=job_id, body=train_data)
post_training_data()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "..\train_model.py", line 218, in post_training_data
self.es_ml.post_data(job_id=self.job_id, body=train_data)
File "..\inc_anamoly\lib\site-packages\elasticsearch\client\utils.py", line 76, in _wrapped
return func(*args, params=params, **kwargs)
File "..\inc_anamoly\lib\site-packages\elasticsearch\client\xpack\ml.py", line 81, in post_data
body=self._bulk_body(body))
AttributeError: 'MlClient' object has no attribute '_bulk_body'
I'm still not clear on exactly how to get post data to jobs api to work via the python elasticsearch client. The documentation says to send the data as so: "A sequence of one or more JSON documents containing the data to be analyzed. Only whitespace characters are permitted in between the documents."
How is this possible in python? The json library only allows multiple json docs to be serialized in a comma separated list format.
After this fix is applied, the model accepts the documents and shows the correct number processed but the results do not seem accurate. Very few anomalies are detected even when sending 5 months of data (about 300,000 json docs).
When I sent the json docs as one string no whitespace separating I got this error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "..\train_model.py", line 289, in post_training_data
self.es_ml.post_data(job_id=self.job_id, body=myjsons)
File "..\inc_anamoly\lib\site-packages\elasticsearch\client\utils.py", line 76, in _wrapped
return func(*args, params=params, **kwargs)
File "..\inc_anamoly\lib\site-packages\elasticsearch\client\xpack\ml.py", line 81, in post_data
body=self.client._bulk_body(body))
File "..\inc_anamoly\lib\site-packages\elasticsearch\transport.py", line 318, in perform_request
status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
File "..\inc_anamoly\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 186, in perform_request
self._raise_error(response.status, raw_data)
File "..\inc_anamoly\lib\site-packages\elasticsearch\connection\base.py", line 125, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: RequestError(400, 'parse_exception', 'The input JSON data is malformed.')
The _bulk_body method will serialize the data to the proper format expected by ML. After applying this fix, passing a list of dicts as the body parameter should work:
As for the results of your job, there could be any number of reasons why you are seeing those results. What type of analysis are you trying to do? and what job configuration did you create?
Also, if you haven't already, take a look at our getting started material for creating ML jobs here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.