Elasticsearch tuning

We are developing a platform where millions of records update or insert on a daily basis.
For developing this platform we are working with MongoDB as primary database and Elasticsearch for secondary database(Only for searching)

Database
There are 2 main MongoDB collection that we need to import on Elasticsearch also

  1. Companies- This collection have total 20+ fields
  2. People- In this collection we have 15+ fields. Every person belongs to a company._id. So here is one to many relation between company and person collection

Elasticsearch Indexes

There are 2 Elasticsearch index that we are using for filter the data

  1. Compay_index: Here we are storing all the company data from company collection of MongoDB
  2. People_index: We are storing the denormalized data here. Company and user information are combined and stored in this index. Because we need to apply the sorting on all columns and pagination.

We are expecting at least 50Millions records in our database

**We wrote our own pipeline to sync the data from mongoDB to Elasticsearch

Questions:

  1. We are facing JVM memory utilization 98% and getting circut_breaking_exception. So please help us to tune our Elasticsearch cluster.
  2. Is that the correct way to store the data in a denormalized way in Elasticsearch index where one company can have max 2L records

You need to share information about the specs of your cluster so people can try to help you, how many nodes, how many index, how many shards per index etc.

Also, if you can share examples of your documents it would help.

Since you are combining the company and user information in one index, why do you need the Company_Index?

I would say that it is better to have everything in one index, for every entry in the people index you would also add the fields company related to the entry, but I understand that this is what you are already doing with the People_Index, right?

Here we are going to create the custom cluster using AWS EC2 instance because Elastic cloud cost is very high from our budget.
We are expecting at least 200Gb data that is frequently updating and inserting new records.

We stored company data in company_index to avoid so many aggregation as we are displaying only company data on separate page also
So now my questions are-

  1. How many nodes do we need in our custom cluster?
  2. What will be the best configurations for those nodes to handle 200GB data?
  3. As you said that it will be good to store all the data in one index. Is that feasible where we are frequently updating the data? Let's suppose a company has 1L users and we update the company info then we also need to update all the company info in 1L user records.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.