Hi, I am indexing articles information into Elasticsearch.
The article information has metadata(article title, authors, viewCount ) and the article content itself.
Metadata is of smaller size around 20KB. And article content is of larger size, it may range anywhere from 100KB to 2MB. And metadata information is frequently updated.
So I want to know how to model this piece of information.
Elastic search documents say "A single document should contain all of the information that is required to decide whether it matches a search request."
But there are several approaches and counter-arguments presented by the team.
-
Index both metadata and content in one document, as the docs say.
The counter-argument: As the metadata is changed frequently and content will be huge, this is a bad approach. So just for small changes, the entire document will be updated again and again. And updating a doc in ES is IO intensive and hence costly operation. -
Hence it was suggested to put metadata in one index and content in another index. The problem with this approach is if we search for a term which is present in both metadata and content (like a term present in both title and content), it is giving us two documents from both the index which should be ideally one. Even aggregations don't suit this scenario.
In this scenario then we have to hit two queries for each index and do Application side join. -
Maintain a parent-child relationship. With metadata being the parent and content being the child in the same index. But with this will I be able to search across both parent and child but get only parent document if both matches the query criteria.
ES docs say only use Parent-child if the parent has multiple children. And even the search queries will be slow.
There is no conclusion that has been arrived.
Please suggest how to model when a single entity contains both metadata and content, and content will be huge.
Thanks.