Histogram based on partial matching in Kibana: field not_analyzed + _all disabled

alex_tz · March 23, 2016, 10:55am

Hey folks, I would like to use Kibana to create a histogram of all the different messages that contain a certain substring.

For example, the data contains messages like "Your event failed Cause 1", "Your event failed Cause 2", etc., where the integers are error codes. I would like to search for the substring "Your event failed Cause" and have Kibana output a histogram of the messages containing that substring, which will basically tell me how many events failed because of error code 1, how many because of error code 2, etc.

I would like to do this under the circumstances where

the _all field is disabled (to save storage space; based on my tests, enabling the _all field would increase my storage requirements by 35%), and
the message field is not_analyzed.

This is what I've tried so far:

my ideal case: disable the _all field and set the message field as not_analyzed => searching for a substring does not return any results.
disable the _all field but set the message field as analyzed => searching for the substring will return the expected results, however I cannot plot the histogram because it will plot the count of the individual tokens in each message rather than the entire message string.

Is there a better way to do this?

My setup is as follows:

a Python script indexes logs into ElasticSearch 2.2.0 using the following mapping:
index_mapping = {
options.doc_type : {
'_all' : {
'enabled' : False
},
'properties' : {
'relative_time' : {'type' : 'string'},
'real_time' : {'type' : 'date'},
'channel' : {'type' : 'string', 'index' : 'not_analyzed'},
'message' : {'type' : 'string', 'index' : 'not_analyzed'},
'path' : {'type' : 'string'},
},
}
}
I then use Kibana 4.4.1 to visualise the data.

Thanks

Khalah_Jones_Golden · March 23, 2016, 3:46pm

Hmm, it sounds to me like you should be using saved filters, and the search bar at the top in order to get your desired affect. Just for a count, this should be a cakewalk for kibana.

Let me know if I need to elaborate.

Peace,
Khalah

alex_tz · March 23, 2016, 4:11pm

Hi Khalah,

I also thought that I should easily be able to do it with saved searches and search bar, but I cannot do partial matches when the _all field is disabled and the message field is not.analyzed (e.g. a search like 'message:"Your event failed Cause 1"' works and returns the message, whereas a search like 'message:"Your event failed Cause"' doesn't return anything).

I need to be able to search for the substring rather than full string, and ideally I want to keep the _all field disabled and the message field not.analyzed.

Hope that's a bit clearer, please let me know if you have any ideas. Thanks

Khalah_Jones_Golden · March 23, 2016, 5:44pm

Hmm, it seems to me like your message field is set to not index your field. Can you confirm this please? That's the only way the field wouldn't be searchable.

alex_tz · March 24, 2016, 12:20pm

The message field is certainly indexed, I have checked and double-checked.

Here is a little experiment I ran today: using a data set of 500MB, I indexed all events into 4 different indices in ES, using different mappings (see screen shot attached).

If you have any other ideas of what might be wrong, please let me know

alex_tz · March 29, 2016, 3:36pm

After some more investigation, I learnt that if the 'message' field is not.analyzed, I can't query it using partial matches (i.e. searching for substrings) like I wanted to, so there's no way around that issue.

It looks like in order to be able to do partial matches I need the 'message' field to be analyzed, but in order to create a meaningful histogram of unique messages, I need the 'message' field to be not.analyzed. The answer to this dilemma is multi-fields: I ended up indexing the 'message' field as both analyzed and not.analyzed, while keeping the '_all' field disabled, like so:

index_mapping = {
options.doc_type : {
'_all' : {
'enabled' : False
},
'properties' : {
'real_time' : {'type' : 'date'},
'channel' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'message' : {'type' : 'string', 'fields' : { 'raw': { 'type': 'string', 'index' : 'not_analyzed'} } },
'path' : {'type' : 'string', 'index' : 'not_analyzed'},
},
}
}

I can use the 'message' field to do partial matches (because it's analyzed) and the 'message.raw' field to create meaningful histograms (because it's not.analyzed). The storage requirements increase from TEST 1 (see image in post above) by 25%, so it is a better solution than enabling the '_all' field.

Now my question becomes, why can't we do partial matches on not.analyzed fields? It seems like an important functionality to have.

Topic		Replies	Views
Extract (substring) and count(distinct) in Kibana Kibana	6	18147	July 6, 2017
Returning partial strings in Kibana visualisation Elasticsearch	2	5624	July 6, 2017
Substring Search on log message in kibana 4 Kibana	2	2143	May 2, 2017
Extracting a portion of a field in kibana Kibana	3	600	July 6, 2017
Filtering in Kibana maybe query dsl? Kibana	6	1162	October 10, 2017

Histogram based on partial matching in Kibana: field not_analyzed + _all disabled

Related topics