Starting a new project (on 7.7) and trying to decide if the join framework (parent/child) makes sense as an approach or not. A simplified model of what I'm building is as follows:
A house (the parent type)
House members (children of a house)
House items (children of a house)
House rooms (children of a house)
etc.
Basically there is one "parent" type record containing multiple lists of children type records. For argument's sake let's say the parent has 10 children arrays, each containing between 10-100 records. The number of children are going to greatly outweigh my number of parents. And each record type will have ~15 properties associated with it.
I feel my choices are:
Model a document containing several "nested" mappings, in which case the parent and all its children are stored together as one gigantic record.
Or I could model a document defining exactly one of each record type, and index each record individually, only filling in its specific fields. This will lead to many more records than the 1st approach however they will be significantly smaller.
I'm looking for any advice or tips in making the decision. According to the docs it seems to suggest the parent/child/join approach makes sense here since the children will greatly outweigh the parents, however I'm a little alarmed at the amount of documentation that warns against performance issues with this approach, e.g. has_child & has_parent.
The ideal way to model this will depend a lot on how frequently different data types are modified/added/deleted as well as how you need to query the data. Trying to replicate relational concepts directly using parent-child and nested documents is in my opinion rarely the best way. Often you instead need to restructure data in Elasticsearch quite significantly.
I have also often seem people provide very artificial examples that hide a lot of the complexity and relevant constraints of the real use case and therefore makes it hard to give good recommendations. It would therefore be good if you could provide the following information about the data types in the hierarchy:
Size of data
Cardinality
How often it is expected to be updated/added/removed
Thanks for the feedback! In my real-world scenario we are designing a solution that has to fit customers of various sizes. I'll provide answers below based on what I would expect to see from an average customer.
Size of data - I would expect 2 million of the "master" type records, and as far as children go let's say 10 lists of children w/ each list having 15 records. That's roughly 300 million records total. Each record type will have roughly the same number of fields/properties as well, let's say 15.
Cardinality - Is going to be a pretty even split between high and low cardinality fields, i.e. about half of the fields will be fed by user input vs the other half coming from dropdowns.
Changes - Master records won't be inserting/edited all that often, by far the bulk of data manipulation will be inserting/editing the children records.
Relationships - The primary relationship is that all children are related to the parent. There are handful (let's say 10) relationships between the children records as well. I was not planning on modeling that information though, instead just flattening out /repeating the data as needed.
Querying - Will primarily occur via searches looking for the master record by information from one, or many, of its children records. So I would imagine a handful of "has_child" subqueries based upon the criteria the user supplies. We could be looking at performing hundreds of thousands of these queries per day. Additionally, once the user has located the master record in question we'll then have to lookup all the children related to it and show them -- I'm thinking we'd use "has_parent" queries here.
Hardware - We would provision 3 nodes w/ 4 CPUs and 8 GB RAM each. HDD size will vary based upon customer.
I am leaning towards the join type approach as it seems it will be more straightforward with regards to adding and editing data, where every single record gets its own corresponding elastic record. If I was to go the other route, embedding all the children data inside the master record, it appears to be much more difficult to add and edit children records, e.g. having to use painless scripts to manipulate specific parts of the elastic document.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.