How do I speed up this query?

#1

I am trying to speed up a query and not sure if I am going about it the right way (being a newcomer to ES).

I am trying to find an address in an index (that contains 200 million records) by it's constituent parts. And I am looking for specific values. So for instance, I have street number (1118), street name (Tower), street type (Rd) and zip code (90210). And I am trying to find the property with that address. So my query:

POST properties/_search
{
  "from": 0,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "streetNumber": {
              "query": "1118"
            }
          }
        },
        {
          "match": {
            "streetName": {
              "query": "Tower"
            }
          }
        },
        {
          "match": {
            "zip": {
              "query": "90210"
            }
          }
        },
        {
          "match": {
            "streetSuffix": {
              "query": "Rd"
            }
          }
        }
      ]
    }
  },
  "size": 10
}

It finds the record fine, but it takes roughly half a second.

Schema for fields I am searching is defined as follows:

"streetName": {
"type": "string",
"analyzer": "citysearch_analyzer"
},
"streetNumber": {
"type": "string",
"analyzer": "keyword_analyzer"
},
"streetSuffix": {
"type": "string",
"analyzer": "keyword_analyzer"
},
"zip": {
"type": "string",
"index": "not_analyzed",
"doc_values": true
}

Can this query be improved?

(David Pilato) #2

What is the output of:

GET /_cat/health?v
GET /_cat/indices?v
GET /_cat/shards?v

Did you try to profile the query?
Could you share the response?
Does it still "slow" after some runs?

#3

The cluster is healthy. The indices are green. And the shards started.

Your question reminded me that I omitted an important piece of information. I can't profile it because we are on an ancient version of Elastic (1.7.6), which doesn't seem to have a profiler from what I can tell. So the only hope here is that I screwed up the query somehow

Posted here.

Running the same query is inconsistent. Majority of responses will be around 400-800 ms. Then some will be like 6 to 20 ms. There is probably caching at play here that rapidly expires. But it doesn't matter - I don't run the same query over and over. The address lookups are different every time.

(David Pilato) #4

That was not my question. I was asking for the output of those commands.

A lot, like A LOT, of improvements have been done over the past years. You should really upgrade to 7.0. Specifically if you don't need the exact total number of hits you can benefit from many improvements. Also a newer JVM can help.

It does matter a lot even if you are searching for other terms.

BTW what kind of hardware do you have? Ssd drives?

#5

GET /_cat/health?v
GET /_cat/indices?v
GET /_cat/shards?v

Upgrade to 6.x is ongoing, but very slowly. Meanwhile I got to deal with what I have.

I am not sure. I'll have to follow up on Monday.

Here is the index definition.

(David Pilato) #6

I don't see an index named properties. Is it an alias?

#7

Yes. Aliased to properties_web04_20180118.

(David Pilato) #8

Did you get the answer?

#9

3 boxes. Each one has 252GB of RAM. And each one has 2 CPUs. Which are 8 cores each and then each core is hyperthreaded giving it poor man's 32 cores per box.

Each box has 252GB of RAM, but the usage is around 30-40GB. The hard drives are all SSD.

(David Pilato) #10

As you have plenty of resources on your machines, would it be possible to start a 7.0.1 cluster on the same hardware, use reindex from remote to read from the 1.7 cluster?

Also maybe running hot threads API while a "long" query is running could tell something?