Advice on cluster configuration


#1

Hi there. I'm sure you've heard this question a million times, so apologies :slight_smile:

I've been reading day and night about indexes, shards, nodes, custom routing, ES metrics to monitor, etc. I'm still a bit stumped on the path forward. I've been testing the default configuration, but looking for some guidance before i go down paths that don't make sense.

Use case: listings search. Think Trulia, Zillow, etc. I'm using ES for the 'reads'. So searching for listings in various locations, with various filters, and doing faceted nav.

Numbers: approx 5 million listings. Only around 250k 'available' (e.g on market). They get get moved to 'sold' or 'leased'. etc. Index size is about 5gb.

Expectation: could have around 50 requests per second for searches. Obviously we can put caching in the application layer, but trying to avoid that unless needed.

Current infra: Elastic cloud default configuration for 'high i/o'. 4 GB, 2 nodes across 2 fault zones. 1 master node.

Current index setup: everything in one index ("Listings"). default sharding (5 primary, 1 replica).

Results: avg response time for 5 concurrent users = ~100ms. Was hoping for closer to ~50.

Cool, so now my questions.

Questions & Ideas i've got so far:

  • Split into 'available' and 'completed' indexes. That way i can control the growth of 'available'.
  • Use custom routing based on neighborhood. Might be overkill so far?
  • Add more nodes. Not sure how many i need?
  • Tweak primary/replica shards. I still don't understand the calcs. I don't think 5gb is considered 'big', so maybe i should go to 1 primary shard? But how many replicas?
  • Don't store the full 'listing'. Just store what's being searched upon, then put the data required for the result in a diff index. But i've got these non-searchable fields as index=false, so i'm not sure if splitting will give more additional gains.
  • In ElasticCloud, if i've got 2x nodes across two zones, i assume they are in different data centres, but will this impact latency, compared to having 2x nodes in the same data centre? Is it neglible?

That's a lot of ideas, and lots of things to test. So looking for some general guidance on my use case, before i waste lots of time. Thanks muchly! :slight_smile:


(David Pilato) #2

I'd start by using only one primary shard and see where it goes.

Then may be share some of your typical queries and some sample documents?


#3

Hey David, thanks for the answer.

When i used one primary shard and 2 replica shards, i got a warning saying "unallocated shards". So, was wondering if i need to use a different config, with the nodes i have in the cluster.

Typical queries

  • Searching on lots of combinations of fields. If a 'listing' has 80 properties, we are searching on about 20 of those (non-searchable fields aren't indexed), and many at once.
  • So basically, mostly bool/terms queries, without scoring. In the worst case (searching with about 20 properties), that would be a bool query with 20 terms queries.
  • No full text searching (for now)
  • Doing lots of aggregations (faceted navigation, so need to aggregate on all fields in the listing)

Sample document

    {
      "_index": "listings",
      "_type": "listing",
      "_id": "18269449-d651-43c2-acb8-ebfa1a91decb",
      "_version": 1,
      "_score": null,
      "_source": {
        "id": "15b0dfad-40cd-4c77-94fe-dc400fdd3181",
        "listingId": 248594,
        "liveOn": "2018-12-06T15:55:52.307",
        "location": {
          "isStreetDisplayed": true,
          "streetNumber": "1",
          "street": "One Street",
          "suburb": "Blah",
          "state": "Foo",
          "country": "Australia",
          "postcode": 3000,
          "latLong": {
            "lat": -37.8167845,
            "lon": 144.9542819
          },
          "suburbHref": "/locations/5713470"
        },
        "listingType": "Rental",
        "statusType": "Available",
        "propertyType": "Apartment",
        "title": "City Point: Stunning Ground Level Convenience!",
        "description": "..snip...lots of text",
        "price": 520,
        "displayPrice": "$520",
        "isUnderOffer": false,
        "office": {
          "companyName": "sfsdf",
          "officeName": "sfsdfds",
          "officeHref": "/offices/27520"
        },
        "isHomeLandPackage": false,
        "isNewConstruction": false,
        "uri": "sdfdsfsd",
        "tier": "Basic",
        "buildingDetails": {},
        "landDetails": {
          "area": 0,
          "frontage": 0,
          "depthLeft": 0,
          "depthRight": 0,
          "depthRear": 0
        },
        "agents": [
          {
            "name": "XYZ",
            "order": 1
          }
        ],
        "images": [
          {
            "identifier": "l-MyDesktop-g4HGoVpcikCNTfLBCPVU-Q"
          },
          {
            "identifier": "l-MyDesktop-QnjGUvg8kE2ghAnNvz0y-g"
          },
          {
            "identifier": "l-MyDesktop-gJK8gwsUvEG-yLYW2sBhDg"
          },
          {
            "identifier": "l-MyDesktop-TGtjv7tZu021JiJu0sp24g"
          },
          {
            "identifier": "l-MyDesktop-0pI_hKKCykm2Pz3shCPOaA"
          },
          {
            "identifier": "l-MyDesktop-GeeHT9eUBEudy18ts2pFcw"
          },
          {
            "identifier": "l-MyDesktop-usn4Wk-uN0OrRme2i71XaA"
          },
          {
            "identifier": "l-MyDesktop-XN0-Yxol00qqAXxpPNl48Q"
          },
          {
            "identifier": "l-MyDesktop-6EAJKrdTWkOkVb3K7kX5ow"
          },
          {
            "identifier": "l-MyDesktop-ERGQrRwGL06nzzds7fYmkQ"
          }
        ],
        "floorPlans": [
          {
            "identifier": "fp-MyDesktop-gf0wX5t3P0ay1nUoAFoTiA"
          }
        ],
        "features": [
          "airConditioning",
          "dishwasher",
          "pool"
        ],
        "bathrooms": 1,
        "bedrooms": 2,
        "carSpaces": 1,
        "carports": 0,
        "garages": 0,
        "ensuites": 0,
        "livingAreas": 0,
        "openSpaces": 0,
        "toilets": 0
      },
      "fields": {
        "liveOn": [
          "2018-12-06T15:55:52.307Z"
        ]
      },
      "sort": [
        1544111752307
      ]
    }

(David Pilato) #4

If you have only 2 data nodes that is expected as one replica can not be allocated.

When I said "share some of your typical queries", I meant share the JSON query itself, not only a description.


#5

Hi there,

Can you please elaborate on why it's expected for one replica to be not allocated? I'm very new to ElasticSearch. I'm guessing if we have 1 primary 2 replica, that means 2 replicas for each primary. So, that would mean 3 shards in total, which can't work across 2 nodes? Maybe i should do 1 primary, 1 replica? or 2 primary, 1 replica? What do you suggest?

Typical queries:

  • Get single listing
{"query":{"bool":{"filter":[{"terms":{"listingId":[248594]}}]}}}
  • Get all listings in a location
{"from":0,"size":10,"sort":[{"boostedOn":{"order":"asc"}}],"query":{"bool":{"filter":[{"terms":{"location.suburb":["Camden"]}},{"terms":{"location.postcode":[2570]}},{"terms":{"location.state":["New South Wales"]}}]}}}

Note: i've thought of potentially combining the suburb, postcode and state fields into one using 'copy_to', then searching upon that, but not sure if that would get much performance gains. (and i also couldn't get 'copy_to' working with the NEST client).

  • Get all 'sale' listings in a location ('sale' is one of 4 types that can be searched upon)
{"from":0,"size":10,"sort":[{"boostedOn":{"order":"asc"}}],"query":{"bool":{"filter":[{"terms":{"location.suburb":["Camden"]}},{"terms":{"location.postcode":[2570]}},{"terms":{"location.state":["New South Wales"]}},{"terms":{"listingType":["Sale"]}}]}}}

So for most cases, we're searching in a location. We're then optionally providing a set of filters upon that (between 1-30 different filters). Usually, it's the 'listingType' filter. After that, it's other filters like beds, baths, cars, etc. I'm also doing aggregations using the same filter, to build faceted navigation. But that's a seperate query (i keep the normal searches + aggregation searches seperate, for performance)

Also - i was talking to ElasticCloud regarding nodes, and they said this:

The way our cloud service works is exactly how you have experienced it, meaning you cannot create two 4GB nodes in the same zone.
However if you have 2 zones each may have 1 node with whatever size you like (which basically means you will have two nodes in the cluster with a tiebreaker).
The only difference here is if the nodes are under the same zone or not, and this is by design.

I understand the need for multiple zones for fault tolerance, but why can't i have more than 1 node in a fault zone?


(David Pilato) #6

If you want to allocate 3 copies of the same data, 1 primary + 2 replicas, which is 3 shards, you need to have 3 data nodes. If you have 2 data nodes, one of the replicas won't be allocated.

For a 2 data nodes cluster, yeah that's better to have only one replica.

About your queries, if you have only one term, use a simple Term query instead of a Terms query. That will not probably change a lot of things though.

The question is more "why would you need that?"


#7

The question is more "why would you need that?"

Because when i did a basic load test of 50 concurrent users, the CPU was 80%+. So - this could potentially be because the threads can't handle that many requests. At some point, threads/CPU will become the bottleneck, so i would like to potentially scale out to more nodes.

How do people scale out then, if we can't have more than 1 node in a fault zone, and we can only create 2 fault zones? Totally confused by ElasticCloud's configuration here..


(Christian Dahlqvist) #8

If you want to scale out, it generally makes sense to deploy across multiple availability zone as you get improved resilience and fault tolerance. If you on the other hand do not care about high availability, it generally make more sense to scale up rather than out when you have small nodes as they s reduces overhead and network traffic.

This is generally considered best practise and adhered to in Elastic Cloud.


#9

@Christian_Dahlqvist sure - understand that, but ElasticCloud doesn't allow me to create more than 2 availability zones anyway?

So even if i wanted 3 nodes across 3 availability zones, i can't do it?


(Christian Dahlqvist) #10

I believe that typically depends on which region you are deploying to. Not all AWS regions have 3+ availability zones.