I have a documents see example
"message":"Syslog connection established; fd='15', server='AF_INET(111.103.111.65:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='14', server='AF_INET(222.103.111.65:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='15', server='AF_INET(333.228.333.64:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"Syslog connection established; fd='14', server='AF_INET(444.444.333.64:1514)', local='AF_INET(0.0.0.0:0)'"
"message":"[0x00001111] [WEBCONSOLE] >> Error: CONSOLE ERROR ERROR TypeError: null is not an object"
"message":"[0x00002222] [WEBCONSOLE] >> Error: CONSOLE ERROR ERROR TypeError: null is not an object"
and I want to aggregate similar documents
4 "message":"Syslog connection established...."
2 "message":"...CONSOLE ERROR ERROR TypeError...."
The intention is to do frequency analysis of syslog messages
I want to identify the most repeated messages in the dataset
the problem is I cant aggregate by term keyword message because if the line differs IP address it is not aggregated together.
I did it by filter aggregation but it has to be done manually but it is time consuming and I feel I miss many important messages.
"message":"*Syslog connection established*"
the dataset is huge 600k documents/sec
I am looking for some kind of function
group by messages with 70% similar words in it.
thank you