Architecture Validation for Production ES Environmet

We are migrating our old system to ElasticSearch. We have our data centers distributed across two location (one primary and one DR). We are thinking of keeping following structure in Production:

We have existing data of about 630g (approx 14Mil documents) which we are going to index into ES before putting it into production. Once its in production, we are expecting to index about 1000 documents daily, however reading can be up to 50 hits per seconds.

I am planning to keep following configuration:

  • No of Primary Shards : 50
  • Number of Replica : 1
  • Each machine have 2 elastic instances running, 1 Master and 1 Primary

At regular intervals we will purge the documents from ES, so the size is pretty much going to be the same.

HA is not our requirement that's why I only kept 1 master node and 1 master eligible node (node.master = true) and 2 Data node (node.data = true). I will keep the Master Eligible node down, and in case something happens to primary one, then I am planning to bring it up manually.

Is this the correct design, depending upon the needs I have?

Assuming the data centres are nearby and have low latency and good throughput, I would instead recommend setting up a cluster made up of a single master/data node per data centre (you always want more than one master eligible node in a cluster). Assuming nothing else is running on the node, give the node 50% of available RAM for heap (6GB) and and set minimum_master_nodes to 2. This will make both nodes share a copy of the cluster state. If one of the nodes were to go down, you would be able to continue serving reads. Note that Elasticsearch can place primary and replica shards on any of the nodes and reassign then whenever required.

You mean something like this?

Also, do you think the infrastructure is enough to serve my needs for searching and indexing as per my requirements? Are 50 shards enough for indexing about 1TB of data or is this an overkill?

I have no idea what your data looks like nor how you are going to query it, so can not tell you if the shard count or resources available are sufficient or not. Best was to find out is to benchmark it with realistic data and queries.

See if this helps. I am having the similar structure for each document:

"mappings": {
      "doc": {
        "properties": {
          "businessdate3": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "businessmode": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "someidentifire": {
            "type": "long"
          },
          "data": {
            "type": "binary"
          },
          "businesstext": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "businesscode": {
            "type": "text",
            "fielddata": true
          },
          "businessno": {
            "type": "long"
          },
          "businesskey5": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "businesskey4": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "emailaddress": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "businesskey3": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "businesskey2": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "businessdate2": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "mimeType": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "quarter": {
            "type": "long"
          },
          "businessDate1": {
            "type": "date",
            "format": "yyyy-MM-dd"
          },
          "businessKey1": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "type": {
            "type": "text",
            "fielddata": true
          },
          "year": {
            "type": "long"
          }
        }
      }
    }

I can search based on based on combination of few things, most common search pattern is either through ID or by year and (either of business keys). I just wanted to take expert advise before starting to build prod environment.

Even with mappings it is hard to tell, so your best bet is to benchmark it and see if it meets your requirements regarding query throughput and latency. If we assume your data takes up the same amount of space on disk once indexed (simplified assumption, which will depend on your data and mappings), each node will have around 630GB of data assuming 1 replica is used. This seems like a lot of data given the amount of RAM and query rate you have specified. Only a small portion of your indices will be cached, so query performance is likely to result in a lot of random disk I/O, which means you probably will need fast storage (local SSDs). I would therefore suspect you may need to add resources to the cluster.

That helps a lot @Christian_Dahlqvist. Thank you for your time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.