We have a use case with a one-to-many relation scenario where number of documents of a child type are huge for one parent type of document. Something similar to a log store where all the events/messages for a specific user session id are stored together and need to be searchable based on properties of both the session and the log messages.
Our queries would be like a free text search that could match one or both the session and the log events. Some times, the user could do additional filter on specific property of the session/event together with a free text term.
We are wondering whether the join type field is the way to go or not?
Are there any real-life examples of a join type field being used in production?
Any guidance is appreciated. Thanks in advance.
Using the join field type requires that all documents related by a join field must reside in a single shard. Each shard in Elasticsearch is limited to around 2 billion documents, but the limit for performance reasons is often lower than that. The limitation that related data must reside in the same shard also make it difficult to use with log type data where you often want to use data streams or other types of time based indices.
Based on the limitations described above I would suggest not doing this.
Thank you Christian. Given those limitations, can we say the join type field is only used in production where the document count in an index would be lower than say 1 billion for example?
Back to our use case, I guess the other option would be to de-normalize the data and keep session details with each log event/message document. But then, sometimes an update is needed for the session information and would trigger an update to all related documents which I understand may have performance impact on the searches as the index grows. Any thoughts how to best model this scenario?
I have primarily seen it used in search use cases. Can't recall ever having seen it used for time-series data. I would expect the practical limit of related documents to be significantly lower than 1 billion as I suspect you will run into performance problems before that point. That does depend on the type and complexity of the data and the queries though.
This is the most common approach and if updates to the parent (session) are rare this can work well as it allows the use of time-based indices.
how would a query like below work when one of the fields in the query is from a parent and the other from a child document?
session.id:abcd AND event.type:open
These type of queries would be entered by our user who know the syntax and could be more complex with round brackets involved and multiple conditions etc.
We want to get some fields from both parent and child document in the query result and should be presented as part of one parent document only.
Something similar in an SQL query would be:
select session.id, session.user, event.date, event.type from session, event where session.id=event.id AND session.id="abcd" AND event.type="open"
Simple query string can as far as I know not handle parent-child relationships. Instead you will need to use query DSL with join query clauses. If your users type in a query box you will most likely need to rewrite the query before sending it to Elasticsearch with this approach.