Hi everyone!
I’m running an Elasticsearch 9.1.0 cluster (6 shards, 0 replicas, refresh disabled) over roughly 1.5 billion email address documents. I need to support fast, case‐insensitive substring searches that match only contiguous character sequences (i.e. exact substrings), for example:
- Should match when searching for
johndoe
:
xxxjohndoe@example.com
xxxjohndoexxx@example.com
johndoexxx@example.com
- Should not match:
xxxjohnxxxdoexxx@example.com
john.doe@example.com
john_doe@example.com
What I’ve tried
- Index mapping
{
"settings": {
"index.number_of_shards": 6,
"index.number_of_replicas": 0,
"index.refresh_interval": "-1"
},
"mappings": {
"properties": {
"email": {
"type": "keyword",
"ignore_above": 256,
"normalizer": "lowercase",
"fields": {
"wc": {
"type": "wildcard",
"ignore_above": 256
}
}
},
"gender": {
"type": "keyword"
}
}
}
}
- Search query
{
"size": 500,
"terminate_after": 1000,
"track_total_hits": false,
"_source": false,
"fields": ["email", "gender"],
"query": {
"constant_score": {
"filter": {
"wildcard": {
"email.wc": {
"value": "*johndoe*",
"case_insensitive": true
}
}
}
}
}
}
Despite using the specialized wildcard
field and "rewrite": "constant_score"
, each query still takes 20–30 seconds, which is far too slow for my needs.
What I’m looking for
- Suggestions on index / search structure that would give me fast, exact substring matching at this scale.
- Alternatives to wildcard queries, are there better ES features or plugins for this use case?
Thanks in advance for any help!