Is the below sizing fits for mentioned use case?

I have been reading many documents to understand about sizing about elastic search...

Data load:

  1. 9k transactions per second
  2. Each transaction contains around 1kb of data.
  3. Per second, 9k inserts and 50k search queries will be done
  4. Per day around 600 GB of data.
  5. Would require indexing for each doc.

Response for each transaction should not take more than 15ms.

My proposal -

  1. Client nodes - 2 ( for HA I have considered 2 nodes, should not miss any transactions)
  2. Master nodes - 3
  3. Data nodes - 3

All are dedicated vms’s for each node in azure.

Any best configuration for the above usecase.

What do you mean by this?

It sounds like indexed data is immutable. How long do you need to keep data in the cluster? How large is the full data set?

How targeted are the queries? What type of queries will you be using? Can you describe the data and mappings?

I mean each record should be indexed after inserting into elastic.

If my understanding is correct... we would require this data forever in the server. There were chances that they might not require the data - will get back on this.

When you say how long do you need to keep data in the cluster - do you mean we have to remove the old data from the data nodes after some period of time ?

Are you planning on using bulk inserts? Does this mean that you want the document searchable immediately after indexing?

Well, if you do not you will need infinite storage...

Not for bulk insert. There were multiple types of insertion here.

  1. We have provided an api in nodes which will trigger elastic servers.

  2. These api’s are called by third party customers ( multiple customers ) - they might upload excel sheet in their portal and trigger our api for each row.

  3. Same api is available for third party customers in which they will do single transaction. We will receive requests on parallel.

Combining above 2 scenarios we would get around 9k transactions per second.

Yes correct. I will get back on this... they might use some period of time and they might archive the data... I will get back on this shortly.

If you do not use bulk inserts and require the indexed document to be searchable immediately you will make indexing very inefficient. This basically goes against most recommendations in this guide around optimizing indexing speed. This will result in a lot of small segments being generated and requiring merging which will put a lot of load on the cluster and result in a lot of disk I/O. In itself indexing 9k documents of 1kB in size is achievable but might require more cluster resources and very fast storage.

You also mention a quite high search rate. In order to optimize the search rate supported by the cluster you ideally want to have immutable data that is fully kept in the operating system file cache. Even if there was no indexing going on you state that you are likely to have very large amounts of data. If this does not fit in the cache it will generate a lot of disk I/O which will lead to longer latencies and reduced query throughput.

If you add these two together you see that the indexing will add new data, which will affect the page cache and make it less efficient. I therefore do not think Elasticsearch is suitable for this use case (unless you work with the requirements) and if you were to try make it work you would need a lot of hardware.

I will get back on this shortly, whether we need indexing or not.

Here are the inputs which you have asked, is it a valid usecase to use ES here ?

There should not be any latency, each request should get processed with in 10 to 20ms.

That sounds much more reasonable but you will need fast disks and enough RAM to allow most data on disk to be cached. In order to determine whether you can meet the SLAs or not you will need to benchmark using realistic data and operations. Sounds feasible with correct timing and bulk indexing.

Any suggestions on the configuration and number of nodes ?

  1. Client nodes - 2 ( for HA I have considered 2 nodes, should not miss any transactions)
  2. Master nodes - 3
  3. Data nodes - 3

is this ok or any best recommendations ?

You will need to test and benchmark.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.