Optimize Elastic Search Settings For Search Speed

I am in the process of rebuilding my search stack. I was running in Elastic Cloud and I am now using dedicated hardware. My new setup consist of the following:
2 Data/Master Nodes with the following configuration
• Processor: Intel Dual Xeon 2X E5-2420 v2 Hex-Core
• OS: CentOS
• RAM: 64GB DDR4 SDRAM
• HD1: 4 x 500Gb SSD HW Raid10 w/FlashedBackCache

1 Dedicated Master Node
• 8 available vCores
• 16GB Dedicated ECC RAM
• 90GB Raid 10 SSD 6Gbps
• 100GB Raid 10 SATA III 6Gbps
• OS: CentOS

1 Coordinator Node
• 8 available vCores
• 16GB Dedicated ECC RAM
• 90GB Raid 10 SSD 6Gbps
• 100GB Raid 10 SATA III 6Gbps
• OS: CentOS

I have install Elastic Stack 6. My data will consist of about 350 Million Records ~700GB of space before indexing and 1.5 TB of space after indexing. The total amount of documents is ~400 million. I have read some articles in favor of smaller indexes. I can’t break index down by date but I can by state . Does it make sense to favor smaller indexes by using state as my delimiter. Most searches will include 1 or many regions as part of the search. I will have a single replica. I am not concerned about index speed as the data is loaded infrequently. I want to have settings optimized for fast search. Majority of the fields being searched are keyword analyzed fields. I have a couple of fields such as email which use a custom analyzer for making the domain only searchable. I initially had 30 shards to store the data. Does that seem like over kill? Currently I access the cluster for search via Nest. I send multiple small queries per second. A common query looks as follows:

    GET /twitter_profile/_search
    {
      "query": {
        "bool": {
          "filter": [
            {
              "bool": {
                "must": [
                   {
              "query_string": {
                "query": "zipCode:76092 AND city: grapevine"
            }
            }
                ]
              }
            }
          ]
        }
      }
    }

Have you seen Tune for search speed | Elasticsearch Reference [6.0] | Elastic ?

Index size does not matter as much as shard size, see How many shards should I have in my Elasticsearch cluster? | Elastic Blog.

If you never filter by state, this won't matter, but if you do then it would be good to do as it would make queries more efficient (fewer documents to search) and improve I/O locality. Also have a look at the recommendation related to index sorting and frequently filtered fields in the "tune for speed" documentation.

I see the zip code is a number, just make sure to map it as a keyword rather than an integer or a long since it is unlikely you will run range queries on it? (Again, see the "tune for speed" documentation.)

Thank you for the links, I am reviewing them. Is keyword type more efficient to use than for instance mapping bool? I have fields which have Yes/No. Is it more search efficient to map that as a keyword or a boolean? I can make all queries include a state filter if it will increase my search throughput because of the smaller sizes of my index.

As far as searching is concerned, this would make no differences as keyword and bool use the same data-structure for searching.

My total data size is around 1.5 TB. I can take my max shards size of 40 GB, would result in about 39 primary shards. I have 2 data nodes which mean each node would contain approximately 19 shards. I know replicas are supposed to help with search throughput but with only 2 nodes does it make sense to take my dedicated master and make it a Master/Data node? The dedicated master only has 16 GB of ram and 90 GB of disk space and it is on a shared server. It is an inferior setup as compared to my data nodes on dedicated hardware. Will I lose overall performance by getting rid of the dedicated master and using it as a data node as well? Do you think my 2 data nodes are oversized? Does a client node make sense with a small cluster? Is it better to make it a data node along with the dedicated master?

Do you want to optimize for search speed in general or optimize for search speed for this hardware? I'm asking because every shard runs in its own thread and you data nodes will have 12 (num of threads of your processor) x 2 (number of data nodes) = 24 threads available. So any setup that leads to more than 24 primary shards will require shard requests from the same request to wait on one another while processing. If you really want to optimize for search speed, you should have 4 data nodes for that number of primary shards.

does it make sense to take my dedicated master and make it a Master/Data node

Elasticsearch works with heterogeneous hardware but it doesn't try to balance the load according to the respective capacity of every node, at least until 6.1 when adaptive replica selection is enabled. Even then, since you are optimizing for search speed, I would not recommend a setup with heterogeneous hardware.

Does a client node make sense with a small cluster?

It depends what kind of requests you plan to run. If you run some complex queries, in particular if you make use of large values of the size or from parameter, or multiple levels of nesting in aggregations, then it might make sense but otherwise you should be able to live without a client node without issues.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.