Searching email


(Mark Fletcher) #1

Hi,

We're using ES to index email, specifically mailing list messages. We'd
like search to work similar to Gmail in that we'd like to match on either
the subject or body of the email, and if it matches on the subject, we only
want to display one result for that match (say the first message in that
thread). In our naive implementation, we have an ES index for subjects and
another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d44932e2-dc6a-4574-a458-c02edbe9a13e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tihomir Lichev) #2

Isn't better to create single document for each mail with fields "subject"
and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can define
boosting for each field. For instance when your search matches the subject
the mail will be scored higher in the result than if it matches the body,
and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages. We'd
like search to work similar to Gmail in that we'd like to match on either
the subject or body of the email, and if it matches on the subject, we only
want to display one result for that match (say the first message in that
thread). In our naive implementation, we have an ES index for subjects and
another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8bc4cf98-fc1c-408f-8c29-7291afcd3cdf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Fletcher) #3

Thanks for your response. If I do as you suggested, a subject match will
return all the messages in that thread (because they all match). I want the
search results to only contain one result if there's a thread match.

I suppose I could just grab all the results and then 'collapse' the thread
matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields "subject"
and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can define
boosting for each field. For instance when your search matches the subject
the mail will be scored higher in the result than if it matches the body,
and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages. We'd
like search to work similar to Gmail in that we'd like to match on either
the subject or body of the email, and if it matches on the subject, we only
want to display one result for that match (say the first message in that
thread). In our naive implementation, we have an ES index for subjects and
another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/065dbefb-a74c-45cd-ba4e-6bb1b396fcfa%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tihomir Lichev) #4

So how you can distinguish the first email from any thread ?
Do you have some additional parameter ?

06 август 2014, сряда, 16:56:10 UTC+3, Mark Fletcher написа:

Thanks for your response. If I do as you suggested, a subject match will
return all the messages in that thread (because they all match). I want the
search results to only contain one result if there's a thread match.

I suppose I could just grab all the results and then 'collapse' the thread
matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields
"subject" and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can define
boosting for each field. For instance when your search matches the subject
the mail will be scored higher in the result than if it matches the body,
and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages. We'd
like search to work similar to Gmail in that we'd like to match on either
the subject or body of the email, and if it matches on the subject, we only
want to display one result for that match (say the first message in that
thread). In our naive implementation, we have an ES index for subjects and
another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/02081a7b-98f2-4b79-b514-29327d072beb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Fletcher) #5

Each thread has a unique integer id (so, every message in a given thread
has a particular thread id). And each email has a unique integer id as
well.

On Wednesday, August 6, 2014 6:59:36 AM UTC-7, Tihomir Lichev wrote:

So how you can distinguish the first email from any thread ?
Do you have some additional parameter ?

06 август 2014, сряда, 16:56:10 UTC+3, Mark Fletcher написа:

Thanks for your response. If I do as you suggested, a subject match will
return all the messages in that thread (because they all match). I want the
search results to only contain one result if there's a thread match.

I suppose I could just grab all the results and then 'collapse' the
thread matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields
"subject" and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can define
boosting for each field. For instance when your search matches the subject
the mail will be scored higher in the result than if it matches the body,
and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages. We'd
like search to work similar to Gmail in that we'd like to match on either
the subject or body of the email, and if it matches on the subject, we only
want to display one result for that match (say the first message in that
thread). In our naive implementation, we have an ES index for subjects and
another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c7159e2d-257b-4ad2-9df0-2b2e60422d4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Tihomir Lichev) #6

So you should be able to use aggregation to get the first email from each
thread.
Kind of :

{
"aggs": {
"threads": {
"terms": {
"field": "thread_id"
},
"aggs": {
"first_email": {
"min": {
"field": "email_id"
}
}
}
}
}
}

06 август 2014, сряда, 17:02:21 UTC+3, Mark Fletcher написа:

Each thread has a unique integer id (so, every message in a given thread
has a particular thread id). And each email has a unique integer id as
well.

On Wednesday, August 6, 2014 6:59:36 AM UTC-7, Tihomir Lichev wrote:

So how you can distinguish the first email from any thread ?
Do you have some additional parameter ?

06 август 2014, сряда, 16:56:10 UTC+3, Mark Fletcher написа:

Thanks for your response. If I do as you suggested, a subject match will
return all the messages in that thread (because they all match). I want the
search results to only contain one result if there's a thread match.

I suppose I could just grab all the results and then 'collapse' the
thread matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields
"subject" and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can
define boosting for each field. For instance when your search matches the
subject the mail will be scored higher in the result than if it matches the
body, and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages.
We'd like search to work similar to Gmail in that we'd like to match on
either the subject or body of the email, and if it matches on the subject,
we only want to display one result for that match (say the first message in
that thread). In our naive implementation, we have an ES index for subjects
and another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a8ee880a-a399-4f27-b698-c8ee8c445b68%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Fletcher) #7

Thanks again for your response. I don't have much experience with
aggregations, but wouldn't that just give me a set of thread id's ordered
by how many messages are in each thread? In my results, it's possible to
have a match on a message body be ranked higher than a match on a subject.
Using this aggregation, wouldn't this just end up showing all subject
matches first?

Thanks,
Mark

On Wed, Aug 6, 2014 at 7:08 AM, Tihomir Lichev shoteff@gmail.com wrote:

So you should be able to use aggregation to get the first email from each
thread.
Kind of :

{
"aggs": {
"threads": {
"terms": {
"field": "thread_id"
},
"aggs": {
"first_email": {
"min": {
"field": "email_id"
}
}
}
}
}
}

06 август 2014, сряда, 17:02:21 UTC+3, Mark Fletcher написа:

Each thread has a unique integer id (so, every message in a given thread
has a particular thread id). And each email has a unique integer id as
well.

On Wednesday, August 6, 2014 6:59:36 AM UTC-7, Tihomir Lichev wrote:

So how you can distinguish the first email from any thread ?
Do you have some additional parameter ?

06 август 2014, сряда, 16:56:10 UTC+3, Mark Fletcher написа:

Thanks for your response. If I do as you suggested, a subject match
will return all the messages in that thread (because they all match). I
want the search results to only contain one result if there's a thread
match.

I suppose I could just grab all the results and then 'collapse' the
thread matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields
"subject" and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can
define boosting for each field. For instance when your search matches the
subject the mail will be scored higher in the result than if it matches the
body, and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages.
We'd like search to work similar to Gmail in that we'd like to match on
either the subject or body of the email, and if it matches on the subject,
we only want to display one result for that match (say the first message in
that thread). In our naive implementation, we have an ES index for subjects
and another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/eQ9XVrbulk8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a8ee880a-a399-4f27-b698-c8ee8c445b68%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a8ee880a-a399-4f27-b698-c8ee8c445b68%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADOuSDHt5Qd8Q%3DXO3S7Ppj4vhBgzZ%2BV9c%3DrjOrXO7k%2BF_m70rA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Tihomir Lichev) #8

I haven't tested such aggregation, but I think the way I wrote it should
give you the oldest email that match the request from each thread. Not sure
how they will be sorted ...

06 август 2014, сряда, 17:31:56 UTC+3, Mark Fletcher написа:

Thanks again for your response. I don't have much experience with
aggregations, but wouldn't that just give me a set of thread id's ordered
by how many messages are in each thread? In my results, it's possible to
have a match on a message body be ranked higher than a match on a subject.
Using this aggregation, wouldn't this just end up showing all subject
matches first?

Thanks,
Mark

On Wed, Aug 6, 2014 at 7:08 AM, Tihomir Lichev <sho...@gmail.com
<javascript:>> wrote:

So you should be able to use aggregation to get the first email from each
thread.
Kind of :

{
"aggs": {
"threads": {
"terms": {
"field": "thread_id"
},
"aggs": {
"first_email": {
"min": {
"field": "email_id"
}
}
}
}
}
}

06 август 2014, сряда, 17:02:21 UTC+3, Mark Fletcher написа:

Each thread has a unique integer id (so, every message in a given thread
has a particular thread id). And each email has a unique integer id as
well.

On Wednesday, August 6, 2014 6:59:36 AM UTC-7, Tihomir Lichev wrote:

So how you can distinguish the first email from any thread ?
Do you have some additional parameter ?

06 август 2014, сряда, 16:56:10 UTC+3, Mark Fletcher написа:

Thanks for your response. If I do as you suggested, a subject match
will return all the messages in that thread (because they all match). I
want the search results to only contain one result if there's a thread
match.

I suppose I could just grab all the results and then 'collapse' the
thread matches, but I was hoping to be able to do something better.

Thanks,
Mark

On Tuesday, August 5, 2014 10:12:32 PM UTC-7, Tihomir Lichev wrote:

Isn't better to create single document for each mail with fields
"subject" and "body" (and whatever else you need from the mail) ?
This way you can search by any or all of the fields, also you can
define boosting for each field. For instance when your search matches the
subject the mail will be scored higher in the result than if it matches the
body, and you will get single set of results.

06 август 2014, сряда, 02:12:52 UTC+3, Mark Fletcher написа:

Hi,

We're using ES to index email, specifically mailing list messages.
We'd like search to work similar to Gmail in that we'd like to match on
either the subject or body of the email, and if it matches on the subject,
we only want to display one result for that match (say the first message in
that thread). In our naive implementation, we have an ES index for subjects
and another for message bodies. But that gets us two sets of results, not
combined. Is there a better way to structure the data, or a query that
we're missing so that we get one set of combined results?

Thanks,
Mark

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/eQ9XVrbulk8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a8ee880a-a399-4f27-b698-c8ee8c445b68%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a8ee880a-a399-4f27-b698-c8ee8c445b68%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb497355-24a6-43e0-a429-66c97e1b0b98%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #9