Help me understand Elasticsearch structure

I'm new to Elasticsearch and I've been reading and playing around with it a lot lately but I still can't understand one thing and am hoping to find the answer here.

From my understanding, Elasticsearch is not something for handling database relationships. It would much more like it if all of the data be in one index.

What I can't figure out is why doing so is a better use case?

Here is my example:
In typical relationship database, I can have something like this

Collection 1

{
   name: 'test',
   collection_2_id: '222',
   collection_3_id: '333',
   collection_4_ids: ['441', '442', '443']
}

Collection 2

{
  id: '222',
  name: 'collection 2'
}

Lets say that in Collection 1 we have 10000 documents and 5000 of them have the same Collection_2_id relationship. When I do a name change in Collection 2 for the document with ID '222', that's one query and it's a done deal.

Now on Elasticsearch this will not work and from what I've read, the solution should be to have the contents of each document instead of IDs.
The index should look something like that:

Collection 1 index

{
  name: 'test',
  Collection_2: { name: 'collection 2' },
  Collection_3: ...
  ...
}

Now this is fine until someone has to change the name of collection 2 document (the one that is the same on 5000 documents). So now, instead of doing 1 query, you have to do 5000 queries to update each document in the index.

To put things into perspective, in my case, I have collections with several relationships and also arrays that include more than one relationship. If I merge all that data together there will be thousands of requests all the time when that data will be updated.

And yes, I need all of that data to be in the index, because the search should filter results based on data from these related collections.

So my question is how do you guys solve this problem? Is it really efficient to do thousands of queries to update repeating fields in each document?

Should I restructure my database to one big collection with subcollections?

Welcome to our community! :smiley:

We aren't all guys, but the question I would put back is can you handle this in your client logic when it talks to Elasticsearch?

My bad, didn't mean to offend anyone, I had all genders in mind to be honest. I would edit the post to include gals but I can't anymore. :grin:

What exactly do you mean "handle this in your client logic"?
If you mean to filter out most data on the client side instead of Elasticsearch then no, there will be too much data eventually and it will make the user experience very bad.
Besides, if this would be an option, then why use Elasticsearch at all.

The challenge is related to scale. Joins work OK on a single machine but once we get into the realm of "big data" (hate that phrase) and needing to spread data across multiple machines the need to do distributed joins raises issues.

I do understand the issue with joins when in use with "big data".

But what I can't understand (and find any proof online) is how changing thousands of documents (or even more) every time when updating repetitive data is better?

What is the right approach here?

"It depends".

In some cases it may be seen as beneficial to have outdated copies of data. For example, if author's country is attached to the metadata of the comments in this thread, then when I move to Canada you can still search for forum replies made by someone living in the UK (that was my location when I made this post).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.