Hi Team,
We are trying to use Elastic for an Analytics use case where data volume is quite large. Documents obtained from Social media channels are stored in Elastic and are queried upon for full content and aggregations. For every document we also maintain the author details. Author details consist of multiple fields like Tags for that author which are updated on the basis of the content that the author has posted. These fields can be updated with new documents according to ETL logics. For e.g. if for a document ID: 1 there is an Author: Jack with Tags:["Celeb","Doctor"] this can be updated with a new Tag value the after processing a new document according to some ETL logic i.e. Tags: ["Celeb","Doctor","Cardiologist"].
We were planning to store these Author details in the primary document itself but since these fields for author are not immutable we will end up updating all the documents that exist in the index for the author e.g. Jack. For e.g. if we have a million document for Jack in our index with some Tags. A new document introduces a new tag "xyz" to Jack. In that case all the one million documents with author as Jack will need to be updated with this value of Tag. Hence this becomes very problematic with data volume going high. The fields can even change with almost every document if every document introduces a new value to this tag field.
The other way we looked at this problem was to store author details in a separate index and then use application joins. The problems with this approach is that the number of authors for a particular tag might be too high. It can even span to a 100k authors. So in that case our UI will have to pass 100k author names to Elastic in a query. Which looks very ugly and is not efficient.
The last thing we could think of was using Parent child indexes. Storing Author as parent and every document for that author as its child. We are not sure on the performance of this approach as one author might have millions of documents but others might have just a few documents.
Could anyone please help us in resolving this problem. Can we go for Parent Child indexes. I'm a bit worried about the performance of this approach since the data volume is quite high. One author might contain millions of documents. The data is also not equally distributed among authors.