Aggregate similar documents

I have a documents see example

"message":"Syslog connection established; fd='15', server='AF_INET(111.103.111.65:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='14', server='AF_INET(222.103.111.65:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='15', server='AF_INET(333.228.333.64:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='14', server='AF_INET(444.444.333.64:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"[0x00001111] [WEBCONSOLE] >> Error:  CONSOLE ERROR ERROR TypeError: null is not an object"
"message":"[0x00002222] [WEBCONSOLE] >> Error:  CONSOLE ERROR ERROR TypeError: null is not an object" 

and I want to aggregate similar documents

4 "message":"Syslog connection established...."
2 "message":"...CONSOLE ERROR ERROR TypeError...."

The intention is to do frequency analysis of syslog messages
I want to identify the most repeated messages in the dataset
the problem is I cant aggregate by term keyword message because if the line differs IP address it is not aggregated together.

I did it by filter aggregation but it has to be done manually but it is time consuming and I feel I miss many important messages.

"message":"*Syslog connection established*"

the dataset is huge 600k documents/sec

I am looking for some kind of function

group by messages with 70% similar words in it.

thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.