How long it should take to search 10Million docs?

mogliang · July 19, 2017, 1:32pm

Question:
I'm trying to search 10 docs from one index (6 shard) which has 2million docs, but it takes about 5 sec. That's far too long for us. My expectation is about 100ms.

Would you please provide some guidance on tuning the search performance?

Any help is much appreciated.

Background:

Cluster: 4Core, 8GB RAM Server x 3. The cluster is nearly empty, we only has 2 index, each has 2million docs, size is about 4GB.

Index schema: One patient table with several short string fields, e.g. Name, PrivateId, Phone, MemberShipCardNumber, Mobile. We need partial match, so use N-gram tokenizer for these fields. The docs come from 7,000 tenants, and the documents are very unbalanced, many of tenants has few docs, and some big tenant has about 400,000 docs. The total docs number would be 10million. Schema are as below:

{
"state": "open",
"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "6",
"provided_name": "t-rds6933o4p20a058709c-patientdocument",
"creation_date": "1500456806347",
"analysis": {
"filter": {
"ngramt320": {
"type": "ngram",
"min_gram": "3",
"max_gram": "20"
},
"edgengramt110": {
"type": "edge_ngram",
"max_gram": "20"
},
"ngramt110": {
"type": "ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"edgengrams110": {
"filter": [
"lowercase"
,
"edgengramt110"
],
"type": "custom",
"tokenizer": "keyword"
},
"ngrams110": {
"filter": [
"lowercase"
,
"ngramt110"
],
"type": "custom",
"tokenizer": "keyword"
},
"ngrams320": {
"filter": [
"lowercase"
,
"ngramt320"
],
"type": "custom",
"tokenizer": "keyword"
}
}
},
"number_of_replicas": "1",
"uuid": "ssOWtdaMTq-aQLABpgts5Q",
"version": {
"created": "5040099"
}
}
},
"mappings": {
"patientdocument": {
"properties": {
"marketCenterAccount": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"notes": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"occupation": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"isInactive": {
"type": "boolean"
},
"docId": {
"type": "keyword"
},
"isArchived": {
"type": "boolean"
},
"medicalAlert": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"membershipId": {
"type": "integer"
},
"patientType": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"point": {
"type": "integer"
},
"lastDoctorId": {
"type": "integer"
},
"pictureId": {
"type": "integer"
},
"doctorId": {
"type": "integer"
},
"officeId": {
"type": "float"
},
"id": {
"type": "float"
},
"tag": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"firstVisit": {
"type": "date"
},
"email": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"homeAddress": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"qq": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"dutyDoctorId": {
"type": "integer"
},
"nickName": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"sex": {
"index": false,
"type": "float"
},
"mobile": {
"analyzer": "ngrams320",
"type": "text"
},
"birth": {
"type": "date"
},
"lastVisit": {
"type": "date"
},
"identityCard": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"otherPrivateId": {
"analyzer": "ngrams320",
"type": "text"
},
"phoneNumber": {
"analyzer": "ngrams320",
"type": "text"
},
"privateId": {
"analyzer": "ngrams320",
"type": "text"
},
"weixin": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"nameCode": {
"analyzer": "edgengrams110",
"type": "text"
},
"sourceType": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"aliMarketCenterAccount": {
"type": "text",
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
}
},
"name": {
"analyzer": "ngrams110",
"type": "text"
},
"tenantId": {
"type": "keyword"
},
"cardNumber": {
"type": "text"
}
}
}
}

Query: I'm using .net NEST library to create query, i'm trying to use one keyword to match several fields, in the same time, the result must come from the particular Tenant and Office.

{
  "from": 0,
  "size": 100,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "tenantId": {
              "value": "e8a5b2d4-4ff9-4bad-b0fe-dbfca2d95deb"
            }
          }
        },
        {
          "term": {
            "isInactive": {
              "value": 0
            }
          }
        }
      ],
      "should": [
        {
          "term": {
            "nameCode": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "name": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "privateId": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "otherPrivateId": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "cardNumber": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "mobile": {
              "value": "135"
            }
          }
        },
        {
          "term": {
            "phoneNumber": {
              "value": "135"
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

What I tried: I was trying to create 1 index per Tenant, however, it will create too much shards. I'm heard that maintaining such amount of shards may need large scale cluster, even some of shards are very very small. I did some test, and did feel the same.
Yes, we were using MySQL database before, and encountered performance issue when doing lots of LIKE %keyword% query, then we are seeking help from ElasticSearch.

profiler result
The profile result of query above:

ype Self Time Total Time % Time
BooleanQuery (+tenantId:e8a5b2d4-4ff9-4bad-b0fe-dbfca... 1.6ms 7.0s
100.00%
TermQuery tenantId:e8a5b2d4-4ff9-4bad-b0fe-dbfca2d... 2.0s 2.0s
28.68%
TermQuery mobile:135 1.5s 1.5s
21.34%
TermQuery isInactive:F 994.7ms 994.7ms
14.19%
TermQuery nameCode:135 895.3ms 895.3ms
12.77%
TermQuery cardNumber:135 600.7ms 600.7ms
8.57%
TermQuery privateId:135 523.5ms 523.5ms
7.47%
TermQuery phoneNumber:135 292.2ms 292.2ms
4.17%
TermQuery otherPrivateId:135 98.8ms 98.8ms
1.41%
TermQuery name:135 97.3ms 97.3ms
1.39%

mogliang · July 21, 2017, 12:41am

I moved some field termquery from MUST to Filter, and use TenantId as routing parameter. after that, i was able to query on 10Millions doc index with a reasonable time. (<1s)

found some useful documents

https://www.elastic.co/guide/en/elasticsearch/reference/5.5/query-filter-context.html

warkolm · July 21, 2017, 4:13am

You got the right solution, filters are always faster than queries.

system · August 18, 2017, 4:14am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need help in improving search speed (about 20s) (elastic 6.3) Elasticsearch	8	462	June 23, 2019
Search in many indices is too slow Elasticsearch	3	346	February 28, 2019
How to speed up query searching on text on an index with 50M+ documents? Elasticsearch	13	943	June 29, 2020
ES indexing times - ES v2.4.1 Elasticsearch	11	500	July 17, 2019
Search Performance of index with large documents (PDF's) Elasticsearch	4	1491	July 5, 2017

How long it should take to search 10Million docs?

Related topics