Histogram based on partial matching in Kibana: field not_analyzed + _all disabled


#1

Hey folks, I would like to use Kibana to create a histogram of all the different messages that contain a certain substring.

For example, the data contains messages like "Your event failed Cause 1", "Your event failed Cause 2", etc., where the integers are error codes. I would like to search for the substring "Your event failed Cause" and have Kibana output a histogram of the messages containing that substring, which will basically tell me how many events failed because of error code 1, how many because of error code 2, etc.

I would like to do this under the circumstances where

  1. the _all field is disabled (to save storage space; based on my tests, enabling the _all field would increase my storage requirements by 35%), and
  2. the message field is not_analyzed.

This is what I've tried so far:

  1. my ideal case: disable the _all field and set the message field as not_analyzed => searching for a substring does not return any results.
  2. disable the _all field but set the message field as analyzed => searching for the substring will return the expected results, however I cannot plot the histogram because it will plot the count of the individual tokens in each message rather than the entire message string.

Is there a better way to do this?

My setup is as follows:

  • a Python script indexes logs into ElasticSearch 2.2.0 using the following mapping:
    index_mapping = {
    options.doc_type : {
    '_all' : {
    'enabled' : False
    },
    'properties' : {
    'relative_time' : {'type' : 'string'},
    'real_time' : {'type' : 'date'},
    'channel' : {'type' : 'string', 'index' : 'not_analyzed'},
    'message' : {'type' : 'string', 'index' : 'not_analyzed'},
    'path' : {'type' : 'string'},
    },
    }
    }
  • I then use Kibana 4.4.1 to visualise the data.

Thanks :slight_smile:


(Khalah Jones Golden) #2

Hmm, it sounds to me like you should be using saved filters, and the search bar at the top in order to get your desired affect. Just for a count, this should be a cakewalk for kibana.

Let me know if I need to elaborate.

Peace,
Khalah


#3

Hi Khalah,

I also thought that I should easily be able to do it with saved searches and search bar, but I cannot do partial matches when the _all field is disabled and the message field is not.analyzed (e.g. a search like 'message:"Your event failed Cause 1"' works and returns the message, whereas a search like 'message:"Your event failed Cause"' doesn't return anything).

I need to be able to search for the substring rather than full string, and ideally I want to keep the _all field disabled and the message field not.analyzed.

Hope that's a bit clearer, please let me know if you have any ideas. Thanks :slight_smile:


(Khalah Jones Golden) #4

Hmm, it seems to me like your message field is set to not index your field. Can you confirm this please? That's the only way the field wouldn't be searchable.


#5

The message field is certainly indexed, I have checked and double-checked.

Here is a little experiment I ran today: using a data set of 500MB, I indexed all events into 4 different indices in ES, using different mappings (see screen shot attached).

If you have any other ideas of what might be wrong, please let me know :slight_smile:


#6

After some more investigation, I learnt that if the 'message' field is not.analyzed, I can't query it using partial matches (i.e. searching for substrings) like I wanted to, so there's no way around that issue.

It looks like in order to be able to do partial matches I need the 'message' field to be analyzed, but in order to create a meaningful histogram of unique messages, I need the 'message' field to be not.analyzed. The answer to this dilemma is multi-fields: I ended up indexing the 'message' field as both analyzed and not.analyzed, while keeping the '_all' field disabled, like so:

index_mapping = {
options.doc_type : {
'_all' : {
'enabled' : False
},
'properties' : {
'real_time' : {'type' : 'date'},
'channel' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'message' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'path' : {'type' : 'string', 'index' : 'not_analyzed'},
},
}
}

I can use the 'message' field to do partial matches (because it's analyzed) and the 'message.raw' field to create meaningful histograms (because it's not.analyzed). The storage requirements increase from TEST 1 (see image in post above) by 25%, so it is a better solution than enabling the '_all' field.

Now my question becomes, why can't we do partial matches on not.analyzed fields? It seems like an important functionality to have.


(system) #7