How to model schema in Elasticsearch when a particular entity contains both metadata and content

Hi, I am indexing articles information into Elasticsearch.
The article information has metadata(article title, authors, viewCount ) and the article content itself.
Metadata is of smaller size around 20KB. And article content is of larger size, it may range anywhere from 100KB to 2MB. And metadata information is frequently updated.

So I want to know how to model this piece of information.

Elastic search documents say "A single document should contain all of the information that is required to decide whether it matches a search request."

But there are several approaches and counter-arguments presented by the team.

  1. Index both metadata and content in one document, as the docs say.
    The counter-argument: As the metadata is changed frequently and content will be huge, this is a bad approach. So just for small changes, the entire document will be updated again and again. And updating a doc in ES is IO intensive and hence costly operation.

  2. Hence it was suggested to put metadata in one index and content in another index. The problem with this approach is if we search for a term which is present in both metadata and content (like a term present in both title and content), it is giving us two documents from both the index which should be ideally one. Even aggregations don't suit this scenario.
    In this scenario then we have to hit two queries for each index and do Application side join.

  3. Maintain a parent-child relationship. With metadata being the parent and content being the child in the same index. But with this will I be able to search across both parent and child but get only parent document if both matches the query criteria.
    ES docs say only use Parent-child if the parent has multiple children. And even the search queries will be slow.

There is no conclusion that has been arrived.

Please suggest how to model when a single entity contains both metadata and content, and content will be huge.

Thanks.

How frequent are you updating your documents? How large portion of the data set is updated every day?

@Christian_Dahlqvist There are around 3 million article records currently. A maximum of around 2000 - 3000 thousands of new records are added everyday. But the chances of a particular record getting updated is very infrequent. Maybe maximum of 10 times during its lifetime.

But we want to add article viewed count into the records in the future. In that case each record will be updated once in a day.
Can you please tell what approach to take in both the scenarios?

It does not sound like that qualified as frequent updates, so I think option 1 should be fine. If you start updating all documents every day and need to include these added stats in querying or relevance calculations, it might at that point be worthwhile instead storing the view counts in a child document as long as this meets your query needs.

Thank you so much @Christian_Dahlqvist.

So just to paraphrase, we will keep both the metadata and content in a single document. But those fields like view count which are updated daily (or frequently) and are used for querying or sorting be kept as child document.
RIght?

But just a related question. What about ES docs which says "Use parent-child relationships sparingly, and only when there are many more children than parents."? Does this case doesn't hold true here?

@Christian_Dahlqvist ?

That is the typical use-case, but it might be useful here as well. Whether it is worth the hassle and potential performance penalty depends on how feasible it is to update the documents if all data was stored together.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.