Issue with split filter plugin since 6.x

Hi all,

I am experiencing very strange issues with the parsing of on of our logs since version 6.x.
The issue is the same for 6.1.0, 6.1.2 and 6.2.0. Other 6.x. versions have not been tested.

We have a logfile with following format:

2017-12-03 00:01:56,936 - activeVehicle=0      MsgCount/Hour=655.3     QueueSum=27       SDB_STORE_TKM=0    SDB_STORE_TKM_1=0    SDB_STORE_TKM_2=0    SDB_STORE_TKM_3=0    SDB_STORE_TKM_4=0    SDB_STORE_TKM_5=0    SDB_STORE_TKM_0=0    TAP_DEBUG=0    TAP_VIC=0    TAP_VIC_1=0    TAP_VIC_2=0    TAP_VIC_3=0    TAP_VIC_4=0    VIC_HANDLER=0    VIC_HANDLER_1=2    VIC_HANDLER_2=2    VIC_HANDLER_3=16   VIC_HANDLER_4=6    VIC_HANDLER_5=0    VIC_HANDLER_6=0    VIC_HANDLER_7=0    VIC_HANDLER_8=0    VIC_HANDLER_9=0    EMAIL_SUBMIT=0    AMESTLOG=0    COM_GWY_MTSUBMT=0    MSG_OUT=0    PROC_SMS=0    QERROR=0    SMS_OUT=0    TTP_IN=0    TTP_IN_1=1    TTP_IN_2=0    TTP_IN_3=0    TTP_IN_4=0    USERACTIONLOG=0    SDB_PROC_TKM=0    SMARTCARD_IN=0    RD_TAP=0    TAP_MESSAGE=0    MSG_HANDLER=0    TAP_WORKFLOW=0    DATAHUB_HANDLER=0    txs_amestlog=0    txs_com_gwy=0    txs_datahub_query=0    txs_db_ccs=11   txs_db_cdr=0    txs_db_fastservices=0    txs_db_locate=0    txs_db_map=0    txs_db_message=2    txs_db_messaging=0    txs_db_normalservices=218  txs_db_protocol=0    txs_db_rd_autoorders=15   txs_db_rd_services=7    txs_db_route=0    txs_db_slowservices=257  txs_db_smartcard=102  txs_db_ttp=3    txs_db_ttp_out=0    txs_db_vic=9    txs_dtco_auth=0    txs_email_gw=0    txs_errque=0    txs_revlocate=0    txs_sessionservices=12   txs_sms=0    txs_smsc_http=0    txs_tap_debug=0    txs_tap_dispatcher=2    txs_tap_messaging=0    txs_tap_vicadapter=4    txs_tap_workflowadapter=0    txs_ttp_legacy=0    txs_ttp_logistics=0    txs_ttp_rd_response_app=0    txs_ttp_remotedownload=9    txs_ttp_test=0    txs_useractionlog=0    

This logfiles contains a list of key value pairs. These are separated by space. My parsing is splitting each key value pair into their own event and output them.

here is my relevant filter for this:

filter
{

	########################################
	# Step 1 (incoming 1st time to logstash) 
	########################################

	# check for type and that the tag "splitted" is NOT in the tags -> Step 1
	if ([logType] == 'performancelog' or [type] == 'performancelog')  and "splitted" not in [tags]
	{
		# extract logtime and the keyValuePart
		grok 
		{
			id => "performancelog:grok:1"
			match => [ 'message','^(?<logTime>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d*\s- activeVehicle=\d+\s+MsgCount\/Hour=\d+\.\d+\s*(?<keyValuePart>.*)']
		}
		
		mutate
		{
			id => "performancelog:mutate:1"
			# remove whitespaces, add delimiter ";" between fields, even if there is no whitespace
			gsub => [ "keyValuePart", "(\s*\w+=\d+)\s*", "\1;"]
			
			# drop the message field -> we dont need it anymore
			remove_field => ['message']
		}
		
		# parse the date to @timestamp
		date 
		{
			id => "performancelog:date:logTime"
			match => ['logTime', 'YYYY-MM-dd HH:mm:ss']
			timezone => "Europe/Berlin"
			remove_field => ['logTime']
		}
		
		# split the key value field to multiple events
		split
		{
			id => "performancelog:split:1"
			field => ['keyValuePart']
			terminator => [';']
			add_tag => ['splitted']
		}
	}

	########################################
	# Step 2 (processing the splitted events) 
	########################################

	# check for type and that the tag "splitted" is in the tags -> Step 2
	if ([logType] == 'performancelog' or [type] == 'performancelog') and "splitted" in [tags]
	{
		# the keyValuePart contain key=value -> parse them
		grok 
		{
			id => "performancelog:grok:2"
			match => [ 'keyValuePart','^(?<TuxPerfLogKey>[^=]+)=(?<TuxPerfLogValue>\d+)']
		}
		
		mutate
		{
			id => "performancelog:mutate:2"
			# convert to integer
			convert => { "TuxPerfLogValue" => "integer" }
			
			# remove fields and tags which are not needed any longer
			remove_field => ['keyValuePart']
			remove_tag => ['splitted']
		}
		
	}
}

Normally, 82 events will be created by a single input line. I've created a testset which holds 20k lines.
When using logstash 5.2.0 I get 1640000 json lines of output (using file output during testing).
When using logstash 6.1.0 I get 9530040 lines as output.

Logstash 6.x's statistics are giving the following information:

{
	"id": "performancelog:date:logTime",
	"events": {
		"duration_in_millis": 10277,
		"in": 116220,
		"out": 116220
	},
	"matches": 116220,
	"name": "date"
},
{
	"id": "performancelog:split:1",
	"events": {
		"duration_in_millis": 185167,
		"in": 116220,
		"out": 9530040
	},
	"name": "split"
},
{	
	"id": "performancelog:grok:2",
	"events": {
		"duration_in_millis": 520538,
		"in": 9530040,
		"out": 9530040
	},
	"matches": 9530040,
	"failures": 0,
	"patterns_per_field": {
		"keyValuePart": 1
	},
	"name": "grok"
},
{
	"id": "performancelog:mutate:2",
	"events": {
		"duration_in_millis": 139577,
		"in": 9530040,
		"out": 9530040
	},
	"name": "mutate"
},
{
	"id": "performancelog:grok:1",	
	"events": {
		"duration_in_millis": 11598,
		"in": 116220,
		"out": 116220
	},
	"matches": 116220,
	"failures": 0,
	"patterns_per_field": {
		"message": 1
	},
	"name": "grok"
},
{
	"id": "performancelog:mutate:1",
	"events": {
		"duration_in_millis": 9072,
		"in": 116220,
		"out": 116220
	},
	"name": "mutate"
},

Also very strange.
If I ship only the first 2k lines, everything is fine. When shipping first 5k lines, the problem occurs.
When shipping lines 2k - 4k or 4k to 6k the everything is also fine. I don't get it and need your help.

I will strip down my config to have only the problematic logfile configured. I will try to create something good reproducable, that elastic can fix it, if it is a bug and not a configuration issue.

keep you updated

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.