Histogram based on partial matching in Kibana: field not_analyzed + _all disabled

Hey folks, I would like to use Kibana to create a histogram of all the different messages that contain a certain substring.

For example, the data contains messages like "Your event failed Cause 1", "Your event failed Cause 2", etc., where the integers are error codes. I would like to search for the substring "Your event failed Cause" and have Kibana output a histogram of the messages containing that substring, which will basically tell me how many events failed because of error code 1, how many because of error code 2, etc.

I would like to do this under the circumstances where

  1. the _all field is disabled (to save storage space; based on my tests, enabling the _all field would increase my storage requirements by 35%), and
  2. the message field is not_analyzed.

This is what I've tried so far:

  1. my ideal case: disable the _all field and set the message field as not_analyzed => searching for a substring does not return any results.
  2. disable the _all field but set the message field as analyzed => searching for the substring will return the expected results, however I cannot plot the histogram because it will plot the count of the individual tokens in each message rather than the entire message string.

Is there a better way to do this?

My setup is as follows:

  • a Python script indexes logs into ElasticSearch 2.2.0 using the following mapping:
    index_mapping = {
    options.doc_type : {
    '_all' : {
    'enabled' : False
    },
    'properties' : {
    'relative_time' : {'type' : 'string'},
    'real_time' : {'type' : 'date'},
    'channel' : {'type' : 'string', 'index' : 'not_analyzed'},
    'message' : {'type' : 'string', 'index' : 'not_analyzed'},
    'path' : {'type' : 'string'},
    },
    }
    }
  • I then use Kibana 4.4.1 to visualise the data.

Thanks :slight_smile:

Hmm, it sounds to me like you should be using saved filters, and the search bar at the top in order to get your desired affect. Just for a count, this should be a cakewalk for kibana.

Let me know if I need to elaborate.

Peace,
Khalah

Hi Khalah,

I also thought that I should easily be able to do it with saved searches and search bar, but I cannot do partial matches when the _all field is disabled and the message field is not.analyzed (e.g. a search like 'message:"Your event failed Cause 1"' works and returns the message, whereas a search like 'message:"Your event failed Cause"' doesn't return anything).

I need to be able to search for the substring rather than full string, and ideally I want to keep the _all field disabled and the message field not.analyzed.

Hope that's a bit clearer, please let me know if you have any ideas. Thanks :slight_smile:

Hmm, it seems to me like your message field is set to not index your field. Can you confirm this please? That's the only way the field wouldn't be searchable.

The message field is certainly indexed, I have checked and double-checked.

Here is a little experiment I ran today: using a data set of 500MB, I indexed all events into 4 different indices in ES, using different mappings (see screen shot attached).

If you have any other ideas of what might be wrong, please let me know :slight_smile:

After some more investigation, I learnt that if the 'message' field is not.analyzed, I can't query it using partial matches (i.e. searching for substrings) like I wanted to, so there's no way around that issue.

It looks like in order to be able to do partial matches I need the 'message' field to be analyzed, but in order to create a meaningful histogram of unique messages, I need the 'message' field to be not.analyzed. The answer to this dilemma is multi-fields: I ended up indexing the 'message' field as both analyzed and not.analyzed, while keeping the '_all' field disabled, like so:

index_mapping = {
options.doc_type : {
'_all' : {
'enabled' : False
},
'properties' : {
'real_time' : {'type' : 'date'},
'channel' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'message' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'path' : {'type' : 'string', 'index' : 'not_analyzed'},
},
}
}

I can use the 'message' field to do partial matches (because it's analyzed) and the 'message.raw' field to create meaningful histograms (because it's not.analyzed). The storage requirements increase from TEST 1 (see image in post above) by 25%, so it is a better solution than enabling the '_all' field.

Now my question becomes, why can't we do partial matches on not.analyzed fields? It seems like an important functionality to have.