Maxing out update queue `remote_transport_exception`

I have set up ES to handle a basic Tinder like dating app data stream for profiles and their swipes. My mapping looks something like this:

{
name: text,
dob: date,
...,
swipedOnBy: Array<(profile IDs)>,
liked: Array<(profile IDs)>
}

swiped on by is a list of all profile IDs that have swiped on a given profile and is used to efficiently hide people that you have already swiped on:
...must_not: [{ term: {swipedOnBy: profile.id}}]

liked is a list of all profile IDs that a given profile has liked and is used to boost those profiles in the search results (show you people that have liked you at the top)

Our issue now is that we have to update two profiles for every swipe (like or pass of another profile) but our ES server seems to be maxing out its write queue and starts rejecting update requests (we are seeing remote_transport_exception).

Details:

  • We have ~400k profiles

  • swipedOnBy and liked can be as big as 5-7k ids

  • We currently update the profiles using a painless script so avoid data loss (lets say we pull a profile, append the new id to the swipedOnBy then push but another instance already added another user, this update will lose that ID when it puts this update)

  • Currently running AWS.DATA.HIGHCPU.M5 with 30GB of RAM @ 7 threads with 200 item capacity

  • Update script has a retry count of 10 to avoid version issues (open to better ways to handle them as this might be a key part to this issue but when excluded we get a ton of version conflict fails)

  • Update script looks something like this:

    // Updating the likee
    if (ctx._source.swipedOnBy != null) {
      if (!ctx._source.swipedOnBy.contains(params.swipedOnBy)) {
          ctx._source.swipedOnBy.add(params.swipedOnBy)
      }
    } else {
      ctx._source.swipedOnBy = new int[] {params.swipedOnBy}
    }
    
    // Updating the liker
    if (ctx._source.liked != null) {
        if (ctx._source.liked.contains(params.liked)) {
             ctx._source.liked.add(params.liked)
        }
    } else {
         ctx._source.liked = new int[] {params.liked}
    }
    

Any helpful thoughts are welcome, I am new to ES so if there is a better way to go about this I am open to suggestions. We do have a lot of updates but would imagine a larger ES instance should be able to handle it as currently our Postgres DB handles these requests NP. While just purchasing a larger instance and possibly more nodes is the obvious solution but unfortunately cost is a factor at play here and if that's the only then that is good information to know as well. I should also note we are not maxing out server resources with our current instance (RAM never over 75%, CPU never over 90%, disk storage 3%) so it would suck if we have to double our spend to just handle more requests.