I am trying to create an ElasticSearch mapping to index an Email. I've read the ES documentation regarding mapping relationships. However I am confused as to what to use to represent recipients?
Each email recipients is something simple consisting of displayName and emailAddress
The ES documentation gives Strong warning to use Parent-Child relationship ONLY when it is really needed and all other options are exhausted. In particular it says to use Parent-Child relationship for a case when there are a few parent with many children.
Most emails have few recipients (less than 50) so my first instinct was to use "Nested-Object". However once in a while there are those.. "all hands" email where the recipients could go up as many as thousands of recipients.
So my dilemma is - my general case seems ideal for Nested-Object, however my edge case seems ideal for Parent-Child relationship. So if there is an experience ElasticSearch users out there that have been through this, I would love to know what mapping relationship you used for this and the reasoning too.
To me, it doesn't really make sense to relate things in this manner. Just store each email as a single doc (aka time based index structures), then analyse.
I am trying to make the data searchable and I am hoping I could search by recipient names. For example search for all mail where one of the recipients name contains "john" and "doe"
Each email is already pre-processed and have the text extract. The email data that I will be receiving will look like this:
Parent-child is very useful when you have parents with large number of children and the parent is updated frequently, as it allows just the parent to be reindexed. As you are modelling e-mails I assume you are not going to update them, and since you want to be able to search on combinations of DisplayName and EmailAddress I would recommend using a nested structure.
Nested-Object is great for the general cases however once in a while I get those "all-hands" email where the recipients are essentially everyone in the company which is up in the thousands (or even tens of thousands).
Based on my reading Nested-Object by default is only up to 1000, so I am afraid that Nested-Object won't be able to handle this edge case.
The limit you specified is for the number of fields in an index. If you model your documents with an array of recipients as follows I think you should be able to avoid this:
{
"Title" : "email1",
"Subject" : "this is a test email",
"SenderDisplayName" : "me",
"SenderEmailAddress" : "me@mycompany.com"
"Recipients" : [
{
"DisplayName" : "Mr. John",
"EmailAddress" : "mrB@hiscompany.com"
},
{
"DisplayName" : "Mr. Doe",
"EmailAddress" : "mrC@hiscompany.com"
}
....
],
"Content" : "This is the content of the email"
}
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.