Here's an example. If I use aggregations to search for the top 10 most
frequent messages:
POST _search
{
"query": {
"match": {
"loglevel": "error"
}
},
"aggs": {
"freqent_msgs": {
"terms": {
"field": "message.raw",
"size": 10
}
}
}
}
I end up with a list that exhibit two undesirable characteristics. The top
3 entries are the same type of message, but have different instances. The
remaining messages are a few different types, but each of them has a
repetitive counter. Is there a way to overlook these differences so the
result would be closer to the 4 message types?
"aggregations": {
"freqent_msgs": {
"buckets": [
{
"key": "Getting disk size of instance-0000bcbb: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
"doc_count": 22599
},
{
"key": "Getting disk size of instance-0000bd08: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
"doc_count": 13447
},
{
"key": "Getting disk size of instance-0000bd09: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
"doc_count": 13447
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 60 seconds",
"doc_count": 32
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 32 seconds",
"doc_count": 15
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 2 seconds",
"doc_count": 12
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 4 seconds",
"doc_count": 10
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 8 seconds",
"doc_count": 9
},
{
"key": "Unable to connect to AMQP server: [Errno 110]
ETIMEDOUT. Sleeping 16 seconds",
"doc_count": 7
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 1 seconds",
"doc_count": 7
}
]
}
}
Thanks,
John
On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:
Hi,
I have a bunch of text events indexed as a message field, and in many
cases, they are similar but not exactly the same. Is there a way to return
the top n most frequently occurring similar phrases, and if so, how would I
control the definition of similar?
Thanks,
John
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.