We currently have a ES 5.3 cluster which uses parent-child mapping as our data follows this model. Parent-child mapping was the best option as it supported cluster-side joins and provided performance benefits from being stored in the same shard.
Migration from 5.3 to 6.2 involves breaking changes as multiple mapping types are no longer supported from ES v6.0 onwards.
Wondering if it's worth sticking to parent-child data model in newer versions of elasticsearch using join data types.
Our data at most have parent-child document count ratio of 1:400. As of now, we have nearly 1 million parent docs. Both parent and child types are read-write heavy.
From what i know, there are three possible options
1. Retain parent-child mapping using the new ' join' data type.
- Cluster-side joins
- Performance benefits
- Can still use has_parent, has_child queries
- Flat mapping schema (gets ugly if the parent/child documents have lot of properties)
- Difficult to identify the doc type just by looking at the document
2. Create separate indexes for each existing mapping type
- Mapping looks much cleaner
- Easy to identify the type of document just by looking at it
- Can make use of routing to put all related documents (for example, all child related to same parent) in the same shard with in the child index.
- Each index can have different shard configuration (flexibility)
- Offers some index level optimization. Parent-Child doc count ratio is around 1:400. We can configure parent index to have less number of shards to keep the overall number of shards low.
- Requires a common field in both indexes to maintain relationship
- Cannot use has_parent, has_child queries anymore. Most of our queries will need to hit both index and need two queries to complete the task. We can optimize this a little bit by denormalizing data but that will end up in data duplication.
- Requires application-side joins
- Child index can get very big compared to parent index
- Multiple queries to perform the join can increase the overall latency
3. Single index but with a custom field to define type of the document
- No need to use join type
- Can make use of custom routing to put related documents in the same shard
- Mapping looks complicated
- Index can get very big as all the documents will reside in the same index
I'm more leaned towards the second option of having multiple indices. The main downside i see is that the application-side joins can be expensive. At the same time, elastic search documentation says has_parent, has_child queries are also expensive.
The other advantage of parent-child documents of being stored in the same shard can be achieved to some extent by having individual indices use routing and hence achieve data locality with in the respective indices.
Would like to know if there's any other performance/scaling aspect i need to consider. Also, can someone comment on the latency impact of each of this approach? Thanks.