Can’t merge a non object mapping with an object mapping error in machine learning(beta) module

Hi,

I'm trying out the new machine learning module in x pack. I'm trying to identify rare response codes in HTTP Access logs in time. My logs are being stored in elasticsearch as below:

{
  "_index": "logstash-2017.05.18",
  "_type": "Accesslog",
  "_id": "AVxvVfFGdMmRr-0X-J5P",
  "_version": 1,
  "_score": null,
  "_source": {
    "request": "/web/Q123/images/buttons/asdf.gif",
    "server": "91",
    "auth": "-",
    "ident": "-",
    "verb": "GET",
    "type": "Accesslog",
    "path": "/path/to/log",
    "@timestamp": "2017-05-18T10:20:00.000Z",
    "response": "304",
    "clientip": "1.1.1.1",
    "@version": "1",
    "host": "ip-10-10-10-10",
    "httpversion": "1.1",
    "timestamp": "18/May/2017:10:20:00 +0530"
  },
  "fields": {
    "@timestamp": [
      1495102800000
    ]
  }

I added a detector where I selected the function as 'rare' and the by_field_name' as 'response'. But when I save the job I get the following error:

Save failed: [illegal_argument_exception] Can't merge a non object mapping [response] with an object mapping [response]

Please help.

Hi Gautam,

The error is due to a mapping clash in the index where the job results are stored. In this case response is a keyword field but that clashes with an existing mapping for response. By default the results from all jobs are stored in a single shared index named .ml-anomalies-shared, that index contains a mapping for response from a different job.

The solution is to use a dedicated index for the job and the results will be saved in a new index without any existing mappings. In the UI when you create the job check the Use Dedicated Index box or if using the API specify a different index with the results_index_name setting.

1 Like

Hi,

That worked!!! Thanks.

But now I'm having trouble while running the datafeed. I got the following error:

Request your assistance.

It looks like your job configuration also uses the field loglevel? And this is mapped as type text?

If so, it would be better to switch to a keyword field that stores the same information if you have such a field.

Another thing you could do is change the job not to use aggregations, and set "_source" : "true" in your job config. Then the datafeed will scroll through the input data and extract the required fields from the _source field instead of getting the required information via an aggregation. (If this is too hard to understand, please post the entire job config in JSON form, which you can get by clicking the twisty arrow to the left of the job name in your screenshot and then clicking on the "JSON" tab. Then I can be more precise about exactly what to change.)

Hi,

Thanks for the reply.

In the machine learning module, there is no option to use the keyword field. It does not show up in the drop-down options. Hence, I am unable to use them.

Please find below the job config in JSON form as requested:

{
  "job_id": "test-advanced",
  "job_type": "anomaly_detector",
  "description": "",
  "create_time": 1500371132668,
  "finished_time": 1500371261941,
  "analysis_config": {
    "bucket_span": "5m",
    "categorization_field_name": "message",
    "detectors": [
      {
        "detector_description": "count by mlcategory partitionfield=type",
        "function": "count",
        "by_field_name": "mlcategory",
        "partition_field_name": "type",
        "detector_rules": []
      }
    ],
    "influencers": [
      "server",
      "type",
      "message",
      "loglevel"
    ]
  },
  "data_description": {
    "time_field": "@timestamp",
    "time_format": "epoch_ms"
  },
  "model_snapshot_retention_days": 1,
  "results_index_name": "custom-test-advanced",
  "data_counts": {
    "job_id": "test-advanced",
    "processed_record_count": 0,
    "processed_field_count": 0,
    "input_bytes": 0,
    "input_field_count": 0,
    "invalid_date_count": 0,
    "missing_field_count": 0,
    "out_of_order_timestamp_count": 0,
    "empty_bucket_count": 0,
    "sparse_bucket_count": 0,
    "bucket_count": 0,
    "input_record_count": 0
  },
  "model_size_stats": {
    "job_id": "test-advanced",
    "result_type": "model_size_stats",
    "model_bytes": 0,
    "total_by_field_count": 0,
    "total_over_field_count": 0,
    "total_partition_field_count": 0,
    "bucket_allocation_failures_count": 0,
    "memory_status": "ok",
    "log_time": 1500371261000,
    "timestamp": -300000
  },
  "datafeed_config": {
    "datafeed_id": "datafeed-test-advanced",
    "job_id": "test-advanced",
    "query_delay": "60s",
    "frequency": "150s",
    "indexes": [
      "prod_log-*"
    ],
    "types": [
      "Messagelog",
      "SystemOut",
      "SystemError",
      "Errorlog"
    ],
    "query": {
      "match_all": {
        "boost": 1
      }
    },
    "scroll_size": 1000,
    "chunking_config": {
      "mode": "auto"
    },
    "state": "stopped"
  },
  "state": "closed"
}

In 5.4 it's true that the dropdown doesn't show the .keyword field for multi fields. However, assuming loglevel is a multi field you can manually change it to loglevel.keyword in the input box.

We strongly recommend that by_field_name, over_field_name, partition_field_name and influencers refer to fields of type keyword (or possibly long/integer/short/byte in cases where the numbers represent a limited set of options, such as HTTP status codes). And we strongly recommend that categorization_field_name refers to a field of type text.

In 5.5 we've made it easier to make the right choices. In 5.5 the dropdown will only show the .keyword versions of multi fields. But it's still possible to manually edit the field names in cases where the dropdown didn't show the correct choice for some reason.

One more thing. Assuming message is a complete log message then it's very bad to have it listed in the influencers, because presumably almost every message is different. The point of categorization is to reduce the practically infinite number of possible messages to manageable number of groupings of similar messages. On the other hand, if your message field contains just a single word chosen from some limited set of options then it's not a good candidate for categorization. (If it's hard to understand what I'm talking about here, I can be more specific if you post a few examples of what's in your message field.)

Hi,

Please find below the sample messages generated:

[12/20/16 10:41:50:295 GMT+05:30] 00000883 SystemErr     R com.common.NonFatalException: Invalid Account Number
	at com.bank.subframework.integration.core.impl.MessageAdaptorImpl.sendToHost(Unknown Source)
	at com.bank.subframework.integration.core.impl.MessageAdaptorImpl.transceive(Unknown Source)
	at com.bank.subframework.integration.core.impl.OrchestratorImpl.process(Unknown Source)
	at com.bank.subframework.integration.core.impl.OrchestratorImpl.transceive(Unknown Source)
	at com.bank.system.common.hostaccess.GenericHostAccess.getResponseFromHost(Unknown Source)

ERROR | 2016-12-21 00:08:14,523 | WebContainer : 51 | 270 | UY | Default_User | NON FATAL error occured in CreditCardBillPay method
Stack
com.bank.system.common.systemNonFatalException: Host Not Available
	at com.bank.system.creditCard.hostinterface.maskedonnect.processRequest
	at com.bank.system.creditCard.hostinterface.maskedonnect.CreditCardBillPay
	at indus.web.creditCard.CCBillPay.cmd.InetCCBillPayService.ccbillPayECS
	at indus.web.creditCard.CCBillPay.cmd.InetCCBillPayService.processRequest

Additionally, I tried entering .keyword fields in the mentioned fields. The JSON config is given below:

{
  "job_id": "test2-advanced",
  "description": "",
  "analysis_config": {
    "bucket_span": "10m",
    "influencers": [
      "type.keyword",
      "server.keyword",
      "loglevel.keyword"
    ],
    "detectors": [
      {
        "function": "count",
        "by_field_name": "mlcategory",
        "partition_field_name": "type.keyword"
      }
    ],
    "categorization_field_name": "message.keyword"
  },
  "data_description": {
    "time_field": "@timestamp"
  },
  "results_index_name": "test2-advanced",
  "datafeed_config": {
    "query": {
      "match_all": {}
    },
    "query_delay": "60s",
    "frequency": "300s",
    "scroll_size": 1000,
    "indexes": [
      "prod_log-*"
    ],
    "types": [
      "Messagelog",
      "SystemOut",
      "SystemError",
      "Errorlog"
    ]
  }
}

The job is not saving now and giving me the following error:

Save failed: [status_exception] A field has a different mapping type to an existing field with the same name. Use the 'results_index_name' setting to assign the job to another index

Thanks.

I think you should change "categorization_field_name": "message.keyword" to "categorization_field_name": "message", because your message field looks like it's best handled as a text field. message looks like a good candidate for categorization.

The reason you're getting that clash error again is that the results index must have been previously used for another job (possibly with the same name) that resulted in different mappings. You could either change "results_index_name": "test2-advanced" to a name you've never used before, or, if the index .ml-anomalies-custom-test2-advanced is completely empty, delete this index and then try saving the job again.

Hi,

Its finally working!!! Although I had to keep "categorization_field_name": "message.keyword". Just message did not work.

The job is running and the data is being processed, however I got an error: Datafeed is encountering errors extracting data: all shards failed approx 2 hours after I started the job (its performing real-time search). But the job is still running and the data is being processed. Is it something to be concerned about?

Glad you finally got something working.

Datafeed is encountering errors extracting data: all shards failed can happen if a scroll ID becomes invalid. That could happen because a node involved in the scroll left the cluster. This is an area we've improved in version 5.5. In version 5.5 we'll retry immediately in this case. But as long as you only see this error infrequently the analysis is probably working well enough to give good results even in version 5.4 (ML beta).

I just remembered why message on its own did not work. It's related to another improvement we've made in 5.5. In 5.4 we would get our data from either doc values or _source, never a mixture. And of course message won't be in doc values as it's text. In 5.5 we'll use whichever of doc values and _source is most appropriate for each field. So, while you're on 5.4 go ahead and see if you get interesting results using message.keyword. However, once you upgrade to 5.5 or beyond it would be best to recreate the job using message, because this will enable efficient drilldown from anomalies in the categorization job to view the original documents that are in the anomalous category. (This drilldown from categories to original documents is another new feature of 5.5.)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.