Is it worse to have very large documents/indices with a ton of data, or split that data into smaller documents/indices using a relational paradigm?

annmarie-switzer · January 2, 2022, 3:23am

Say I have a store and each product I sell has a document in Elasticsearch:

{
    id: "abc",
    name: "mittens",
    price: "$10.00"
},
{
    id: "def",
    name: "hat",
    price: "$13.00"
}

My store's website allows users to log in and review products. However, at this store, the reviews shown are on a user-by-user basis rather than aggregated. Meaning, if Harry gives mittens a 4-star review, and Hermione gives them a 2-star review, I want Harry to see 4 stars and Hermione to see 2.

In order to achieve this, I need to somewhere store each individual review for every user. I'm wondering what the most performant and efficient way is to store this data.

Option 1: I can store it on the product document itself, like so:

{
    id: "abc",
    name: "mittens",
    price: "$10.00",
    ratings: [
        { user: "Harry", stars: 4 },
        { user: "Hermione", stars: 2 },
        ...
    ]
}

This makes retrieving and displaying the data really easy, but I wonder how reasonable this model is if I had users in the thousands.

Option 2: use separate indices in a relational-like model. The product index would have documents with product data only, no ratings. The second index would be just for users' ratings and would have documents like so:

{
    user: "Harry",
    ratings: [
        { id: "abc", stars: 4 },
        ...
    ]
},
{
    user: "Hermione",
    ratings: [
        { id: "abc", stars: 2 },
        { id: "def", stars: 3 },
        ...
    ]
}

This option keeps the individual documents in each index relatively small, but it complicates querying and displaying data. I'm also concerned that using this relational-paradigm is "wrong" for something like Elasticsearch.

So at a high level, I'm wondering which option is the most performant and scalable. Or if there is perhaps an even better option that I haven't thought of. Thanks!

Tomo_M · January 3, 2022, 3:59pm

It depends on scale, review frequency and performance requirement.
If the number of users and products are some thousands, any options may be acceptable.

both option 1 and 2 have the same updating problem for scaling. To update documents in Elasticsearch, documents were deleted and new comlete documents were indexed internally.
One possible option is similar to option 2 but index each review as one document and use aggregation on query.

system · January 31, 2022, 3:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Question about best practices - should I create a separate index when objects can within a document be in the thousands Elasticsearch	1	356	June 25, 2018
Advice on Index and Cluster Structure? Elasticsearch	4	1023	July 5, 2017
Store and query by user metadata (last viewed, etc.) Elasticsearch	5	1467	August 24, 2018
Index design for large lists of document references Elasticsearch	1	432	August 3, 2020
Most efficient way to model data in Elasticsearch Elasticsearch	4	870	July 5, 2017

Is it worse to have very large documents/indices with a ton of data, or split that data into smaller documents/indices using a relational paradigm?

Related topics