Most frequently occurring phrases?

Hi,

I have a bunch of text events indexed as a message field, and in many
cases, they are similar but not exactly the same. Is there a way to return
the top n most frequently occurring similar phrases, and if so, how would I
control the definition of similar?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/861cd17c-3897-4fd0-8a66-847f7cabdb8a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Here's an example. If I use aggregations to search for the top 10 most
frequent messages:

POST _search
{
"query": {
"match": {
"loglevel": "error"
}
},
"aggs": {
"freqent_msgs": {
"terms": {
"field": "message.raw",
"size": 10
}
}
}
}

I end up with a list that exhibit two undesirable characteristics. The top
3 entries are the same type of message, but have different instances. The
remaining messages are a few different types, but each of them has a
repetitive counter. Is there a way to overlook these differences so the
result would be closer to the 4 message types?

"aggregations": {
"freqent_msgs": {
"buckets": [
{
"key": "Getting disk size of instance-0000bcbb: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
"doc_count": 22599
},
{
"key": "Getting disk size of instance-0000bd08: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
"doc_count": 13447
},
{
"key": "Getting disk size of instance-0000bd09: [Errno 2] No
such file or directory:
'/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
"doc_count": 13447
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 60 seconds",
"doc_count": 32
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 32 seconds",
"doc_count": 15
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 2 seconds",
"doc_count": 12
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 4 seconds",
"doc_count": 10
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 8 seconds",
"doc_count": 9
},
{
"key": "Unable to connect to AMQP server: [Errno 110]
ETIMEDOUT. Sleeping 16 seconds",
"doc_count": 7
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 1 seconds",
"doc_count": 7
}
]
}
}

Thanks,
John

On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:

Hi,

I have a bunch of text events indexed as a message field, and in many
cases, they are similar but not exactly the same. Is there a way to return
the top n most frequently occurring similar phrases, and if so, how would I
control the definition of similar?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hey,

as these two sample messages a very different in nature, it is hard to use
something like scripting to cut those messages off after a certain length
as a workaround. I would go with some sort of preprocessing (maybe using
logstash), where you give each message a certain type/identifier and facet
on that one.

--Alex

On Wed, Apr 9, 2014 at 7:34 PM, John Stanford jxstanford@gmail.com wrote:

Here's an example. If I use aggregations to search for the top 10 most
frequent messages:

POST _search
{
"query": {
"match": {
"loglevel": "error"
}
},
"aggs": {
"freqent_msgs": {
"terms": {
"field": "message.raw",
"size": 10
}
}
}
}

I end up with a list that exhibit two undesirable characteristics. The
top 3 entries are the same type of message, but have different instances.
The remaining messages are a few different types, but each of them has a
repetitive counter. Is there a way to overlook these differences so the
result would be closer to the 4 message types?

"aggregations": {
"freqent_msgs": {
"buckets": [
{
"key": "Getting disk size of instance-0000bcbb: [Errno 2]
No such file or directory:
'/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
"doc_count": 22599
},
{
"key": "Getting disk size of instance-0000bd08: [Errno 2]
No such file or directory:
'/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
"doc_count": 13447
},
{
"key": "Getting disk size of instance-0000bd09: [Errno 2]
No such file or directory:
'/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
"doc_count": 13447
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 60 seconds",
"doc_count": 32
},
{
"key": "Unable to connect to AMQP server: [Errno 113]
EHOSTUNREACH. Sleeping 32 seconds",
"doc_count": 15
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 2 seconds",
"doc_count": 12
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 4 seconds",
"doc_count": 10
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 8 seconds",
"doc_count": 9
},
{
"key": "Unable to connect to AMQP server: [Errno 110]
ETIMEDOUT. Sleeping 16 seconds",
"doc_count": 7
},
{
"key": "Unable to connect to AMQP server: [Errno 111]
ECONNREFUSED. Sleeping 1 seconds",
"doc_count": 7
}
]
}
}

Thanks,
John

On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:

Hi,

I have a bunch of text events indexed as a message field, and in many
cases, they are similar but not exactly the same. Is there a way to return
the top n most frequently occurring similar phrases, and if so, how would I
control the definition of similar?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_OoWWp1nBVdwkWriSk4zFftEr2hRX%3DTAsx8vMT2StfQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Alex,

Yeah, I'm doing that with some other message types, but was hoping to keep that to select messages with metrics in them. I may look into some post processing strategies, and will keep searching for a reasonable solution within elasticsearch.

Thanks,

John

On Apr 10, 2014, at 11:05 PM, Alexander Reelsen alr@spinscale.de wrote:

Hey,

as these two sample messages a very different in nature, it is hard to use something like scripting to cut those messages off after a certain length as a workaround. I would go with some sort of preprocessing (maybe using logstash), where you give each message a certain type/identifier and facet on that one.

--Alex

On Wed, Apr 9, 2014 at 7:34 PM, John Stanford jxstanford@gmail.com wrote:
Here's an example. If I use aggregations to search for the top 10 most frequent messages:

POST _search
{
"query": {
"match": {
"loglevel": "error"
}
},
"aggs": {
"freqent_msgs": {
"terms": {
"field": "message.raw",
"size": 10
}
}
}
}

I end up with a list that exhibit two undesirable characteristics. The top 3 entries are the same type of message, but have different instances. The remaining messages are a few different types, but each of them has a repetitive counter. Is there a way to overlook these differences so the result would be closer to the 4 message types?

"aggregations": {
"freqent_msgs": {
"buckets": [
{
"key": "Getting disk size of instance-0000bcbb: [Errno 2] No such file or directory: '/var/lib/nova/instances/9b173949-c34d-401e-a214-8e3d8ddefd46/disk'",
"doc_count": 22599
},
{
"key": "Getting disk size of instance-0000bd08: [Errno 2] No such file or directory: '/var/lib/nova/instances/a4e2c7b5-093a-494f-bdef-5b6997e7c3bb/disk'",
"doc_count": 13447
},
{
"key": "Getting disk size of instance-0000bd09: [Errno 2] No such file or directory: '/var/lib/nova/instances/ca680c42-f7c8-49ea-b46e-8864051c860c/disk'",
"doc_count": 13447
},
{
"key": "Unable to connect to AMQP server: [Errno 113] EHOSTUNREACH. Sleeping 60 seconds",
"doc_count": 32
},
{
"key": "Unable to connect to AMQP server: [Errno 113] EHOSTUNREACH. Sleeping 32 seconds",
"doc_count": 15
},
{
"key": "Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 2 seconds",
"doc_count": 12
},
{
"key": "Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 4 seconds",
"doc_count": 10
},
{
"key": "Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 8 seconds",
"doc_count": 9
},
{
"key": "Unable to connect to AMQP server: [Errno 110] ETIMEDOUT. Sleeping 16 seconds",
"doc_count": 7
},
{
"key": "Unable to connect to AMQP server: [Errno 111] ECONNREFUSED. Sleeping 1 seconds",
"doc_count": 7
}
]
}
}

Thanks,
John

On Monday, April 7, 2014 4:26:59 PM UTC-7, John Stanford wrote:
Hi,

I have a bunch of text events indexed as a message field, and in many cases, they are similar but not exactly the same. Is there a way to return the top n most frequently occurring similar phrases, and if so, how would I control the definition of similar?

Thanks,
John

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/803575fb-fae1-43d0-9085-2e7fdc21f321%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/9bQdUgTQqgU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM_OoWWp1nBVdwkWriSk4zFftEr2hRX%3DTAsx8vMT2StfQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/14301933-4556-4F89-BB5E-B4E9A3F79D3E%40gmail.com.
For more options, visit https://groups.google.com/d/optout.