Scalability of Nested Objects & Good Practice

mvkfg · June 17, 2024, 11:31pm

Hello,
I am working in an Elasticsearch environment in which I can monitor and search logs within a game server. This includes a reputation/punishment system, or voting, for example to kick a cheater from the game.

Example document:

{
  "id": "example-doc-id",
  "user_id": 7656xxxxxxxxxxx74,
  "display_name": "mr_cheater",
  "user_bio": "I am a user with a bad reputation.",
  "votes": {
    "positive": [
      7656xxxxxxxxxxx59,
      7656xxxxxxxxxxx71
    ],
    "negative": [
      7656xxxxxxxxxxx12,
      7656xxxxxxxxxxx63,
      7656xxxxxxxxxxx00,
      7656xxxxxxxxxxx84
    ]
  }
}

As clarification, the votes are an array of player IDs used by the game engine. A vote could have more options than just "positive" and "negative" ratings, which is why they are nested instead of their own fields, but my concern lies in this: a theoretically unlimited number of participants can vote for a given option. An average Joe who plays casually (a majority) may have one or two ratings on their profile, whereas competitive players will have many more.

I am aware that large documents will gradually slow Elasticsearch down, with a default limit of 100MB, so I'm wondering how many votes a document could have before there is any noticeable issue with query speed. A few hundred? Few thousand? Or, would we not see much difference until we're into the realm of megabytes+?

For reference, our cluster has about 12GB of RAM currently, which we are able to scale up. I'd like to get some opinions on whether this is good/bad practice before implementing it in production.

Thank you in advance for your assistance,
Matthias

Christian_Dahlqvist · June 18, 2024, 5:04am

I think it is impossible for anyone to tell with any accuracy as it will depend on query and update patterns, so would recommend setting up a test/benchmark to try it out with your data, cluster and query/update patterns. As documents grow they take more effort to retrieve but also to update.

If you are not querying on user data together with user voting data it might make sense to store votes individually in a separate index, possibly using an ID to guarantee uniqueness. This might be a good approach if you primarily want to aggregate across the votes.

mvkfg · June 18, 2024, 1:13pm

Hello,
Thank you for your advice. The system we have in place is primarily querying both user data and voting data together, as account data is handled by a different system entirely.

However, I will indeed consider placing the reputation votes in an index of their own, should our benchmarks show poor performance or if our team needs voter aggregation.

Thanks,
Matthias

Topic		Replies	Views
Dealing with large documents (architecture question) Elasticsearch	4	633	September 25, 2017
ElasticSearch at scale Elasticsearch	4	1635	July 6, 2017
Limits on number of documents Elasticsearch	2	368	July 6, 2017
ES indexing times - ES v2.4.1 Elasticsearch	11	536	July 17, 2019
Poor Query Performance using nested document structure Elasticsearch	7	3666	June 4, 2018

Scalability of Nested Objects & Good Practice

Related topics