How to send data post data to jobs api elasticsearch

I cannot figure out how to send json data via python for posting data to jobs. The documentation is not clear on the format the json file needs to be in. I've tried many different options, but I continue getting different errors.

Here is an example of a json doc saved as file_name.json:

[{"myid": "id1", "client": "client1", "submit_date": 1514764857},{"my_id": "id2", "client": "client_2", "submit_date": 1514764857}]
from elasticsearch import Elasticsearch
from elasticsearch.client.xpack import MlClient

es = elastic_connection()
es_ml = MlClient(es)

def post_training_data(directory='Training Data', file_name='file_name.json'):
        with open(os.path.join(directory, file_name), mode='r') as train_file:
            train_data = json.load(train_file)
            es_ml.post_data(job_id=job_id, body=train_data)

post_training_data()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "..\train_model.py", line 218, in post_training_data
    self.es_ml.post_data(job_id=self.job_id, body=train_data)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\client\utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\client\xpack\ml.py", line 81, in post_data
    body=self._bulk_body(body))
AttributeError: 'MlClient' object has no attribute '_bulk_body'

You found and filed a bug: https://github.com/elastic/elasticsearch-py/issues/959

Thanks for being a great community member! :beers:

I'm still not clear on exactly how to get post data to jobs api to work via the python elasticsearch client. The documentation says to send the data as so: "A sequence of one or more JSON documents containing the data to be analyzed. Only whitespace characters are permitted in between the documents."

How is this possible in python? The json library only allows multiple json docs to be serialized in a comma separated list format.

After this fix is applied, the model accepts the documents and shows the correct number processed but the results do not seem accurate. Very few anomalies are detected even when sending 5 months of data (about 300,000 json docs).

When I sent the json docs as one string no whitespace separating I got this error:

Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "..\train_model.py", line 289, in post_training_data
    self.es_ml.post_data(job_id=self.job_id, body=myjsons)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\client\utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\client\xpack\ml.py", line 81, in post_data
    body=self.client._bulk_body(body))
  File "..\inc_anamoly\lib\site-packages\elasticsearch\transport.py", line 318, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\connection\http_urllib3.py", line 186, in perform_request
    self._raise_error(response.status, raw_data)
  File "..\inc_anamoly\lib\site-packages\elasticsearch\connection\base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: RequestError(400, 'parse_exception', 'The input JSON data is malformed.')

Hi,

The _bulk_body method will serialize the data to the proper format expected by ML. After applying this fix, passing a list of dicts as the body parameter should work:

In [56]: data
Out[56]:
[{'client': 'client1', 'my_id': 'id1', 'submit_date': 1514764857},
 {'client': 'client2', 'my_id': 'id2', 'submit_date': 1514764857}]

In [57]: es.xpack.ml.open_job('test')
Out[57]: {'opened': True}

In [58]: es.xpack.ml.post_data('test', body=data)
Out[58]:
{'bucket_count': 0,
 'earliest_record_timestamp': 1514764857,
 'empty_bucket_count': 0,
 'input_bytes': 120,
 'input_field_count': 4,
 'input_record_count': 2,
 'invalid_date_count': 0,
 'job_id': 'test',
 'last_data_time': 1558635579074,
 'latest_record_timestamp': 1514764857,
 'missing_field_count': 2,
 'out_of_order_timestamp_count': 0,
 'processed_field_count': 0,
 'processed_record_count': 2,
 'sparse_bucket_count': 0}

As for the results of your job, there could be any number of reasons why you are seeing those results. What type of analysis are you trying to do? and what job configuration did you create?

Also, if you haven't already, take a look at our getting started material for creating ML jobs here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.