How long it should take to search 10Million docs?


(梁琦) #1

Question:
I'm trying to search 10 docs from one index (6 shard) which has 2million docs, but it takes about 5 sec. That's far too long for us. My expectation is about 100ms.

Would you please provide some guidance on tuning the search performance?

Any help is much appreciated.

Background:

Cluster: 4Core, 8GB RAM Server x 3. The cluster is nearly empty, we only has 2 index, each has 2million docs, size is about 4GB.

Index schema: One patient table with several short string fields, e.g. Name, PrivateId, Phone, MemberShipCardNumber, Mobile. We need partial match, so use N-gram tokenizer for these fields. The docs come from 7,000 tenants, and the documents are very unbalanced, many of tenants has few docs, and some big tenant has about 400,000 docs. The total docs number would be 10million. Schema are as below:

{
"state": "open",
"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "6",
"provided_name": "t-rds6933o4p20a058709c-patientdocument",
"creation_date": "1500456806347",
"analysis": {
"filter": {
"ngramt320": {
"type": "ngram",
"min_gram": "3",
"max_gram": "20"
},
"edgengramt110": {
"type": "edge_ngram",
"max_gram": "20"
},
"ngramt110": {
"type": "ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"edgengrams110": {
"filter": [
"lowercase"
,
"edgengramt110"
],
"type": "custom",
"tokenizer": "keyword"
},
"ngrams110": {
"filter": [
"lowercase"
,
"ngramt110"
],
"type": "custom",
"tokenizer": "keyword"
},
"ngrams320": {
"filter": [
"lowercase"
,
"ngramt320"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1",
"uuid": "ssOWtdaMTq-aQLABpgts5Q",
"version": {
"created": "5040099"
}
}
},
"mappings": {
"patientdocument": {
"properties": {
"marketCenterAccount": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"notes": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"occupation": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"isInactive": {
"type": "boolean"
},
"docId": {
"type": "keyword"
},
"isArchived": {
"type": "boolean"
},
"medicalAlert": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"membershipId": {
"type": "integer"
},
"patientType": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"point": {
"type": "integer"
},
"lastDoctorId": {
"type": "integer"
},
"pictureId": {
"type": "integer"
},
"doctorId": {
"type": "integer"
},
"officeId": {
"type": "float"
},
"id": {
"type": "float"
},
"tag": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"firstVisit": {
"type": "date"
},
"email": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"homeAddress": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"qq": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"dutyDoctorId": {
"type": "integer"
},
"nickName": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"sex": {
"index": false,
"type": "float"
},
"mobile": {
"analyzer": "ngrams320",
"type": "text"
},
"birth": {
"type": "date"
},
"lastVisit": {
"type": "date"
},
"identityCard": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"otherPrivateId": {
"analyzer": "ngrams320",
"type": "text"
},
"phoneNumber": {
"analyzer": "ngrams320",
"type": "text"
},
"privateId": {
"analyzer": "ngrams320",
"type": "text"
},
"weixin": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"nameCode": {
"analyzer": "edgengrams110",
"type": "text"
},
"sourceType": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"aliMarketCenterAccount": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"name": {
"analyzer": "ngrams110",
"type": "text"
},
"tenantId": {
"type": "keyword"
},
"cardNumber": {
"type": "text"
}
}
}
}

Query: I'm using .net NEST library to create query, i'm trying to use one keyword to match several fields, in the same time, the result must come from the particular Tenant and Office.

{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "tenantId": {
              "value": "e8a5b2d4-4ff9-4bad-b0fe-dbfca2d95deb"
            }
          }
        },
        {
          "term": {
            "isInactive": {
              "value": 0
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "nameCode": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "privateId": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "otherPrivateId": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "cardNumber": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "mobile": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "phoneNumber": {
              "value": "135"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

What I tried: I was trying to create 1 index per Tenant, however, it will create too much shards. I'm heard that maintaining such amount of shards may need large scale cluster, even some of shards are very very small. I did some test, and did feel the same.
Yes, we were using MySQL database before, and encountered performance issue when doing lots of LIKE %keyword% query, then we are seeking help from ElasticSearch.

profiler result
The profile result of query above:

ype Self Time Total Time % Time
BooleanQuery (+tenantId:e8a5b2d4-4ff9-4bad-b0fe-dbfca... 1.6ms 7.0s
100.00%
TermQuery tenantId:e8a5b2d4-4ff9-4bad-b0fe-dbfca2d... 2.0s 2.0s
28.68%
TermQuery mobile:135 1.5s 1.5s
21.34%
TermQuery isInactive:F 994.7ms 994.7ms
14.19%
TermQuery nameCode:135 895.3ms 895.3ms
12.77%
TermQuery cardNumber:135 600.7ms 600.7ms
8.57%
TermQuery privateId:135 523.5ms 523.5ms
7.47%
TermQuery phoneNumber:135 292.2ms 292.2ms
4.17%
TermQuery otherPrivateId:135 98.8ms 98.8ms
1.41%
TermQuery name:135 97.3ms 97.3ms
1.39%


(梁琦) #2

I moved some field termquery from MUST to Filter, and use TenantId as routing parameter. after that, i was able to query on 10Millions doc index with a reasonable time. (<1s)

found some useful documents


https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-filter-context.html


(Mark Walkom) #3

You got the right solution, filters are always faster than queries.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.