How can I achieve fast substring search on a 1.5 billion documents index?

Hi everyone!

I’m running an Elasticsearch 9.1.0 cluster (6 shards, 0 replicas, refresh disabled) over roughly 1.5 billion email address documents. I need to support fast, case‐insensitive substring searches that match only contiguous character sequences (i.e. exact substrings), for example:

  • Should match when searching for johndoe:
xxxjohndoe@example.com
xxxjohndoexxx@example.com
johndoexxx@example.com
  • Should not match:
xxxjohnxxxdoexxx@example.com
john.doe@example.com
john_doe@example.com

What I’ve tried

  1. Index mapping
{
    "settings": {
        "index.number_of_shards": 6,
        "index.number_of_replicas": 0,
        "index.refresh_interval": "-1"
    },
    "mappings": {
        "properties": {
            "email": {
                "type": "keyword",
                "ignore_above": 256,
                "normalizer": "lowercase",
                "fields": {
                    "wc": {
                        "type": "wildcard",
                        "ignore_above": 256
                    }
                }
            },
            "gender": {
                "type": "keyword"
            }
        }
    }
}
  1. Search query
{
  "size": 500,
  "terminate_after": 1000,
  "track_total_hits": false,
  "_source": false,
  "fields": ["email", "gender"],
  "query": {
    "constant_score": {
      "filter": {
        "wildcard": {
          "email.wc": {
            "value": "*johndoe*",
            "case_insensitive": true
          }
        }
      }
    }
  }
}

Despite using the specialized wildcard field and "rewrite": "constant_score", each query still takes 20–30 seconds, which is far too slow for my needs.

What I’m looking for

  1. Suggestions on index / search structure that would give me fast, exact substring matching at this scale.
  2. Alternatives to wildcard queries, are there better ES features or plugins for this use case?

Thanks in advance for any help!

What size are the shards?

What is the average size of the indexed documents?

What is the specification of the cluster in terms of node count, RAM, CPU and type of storage used?

How many matches does a typical search return?

Does latency improve if you reduce the size of the result set?

Do you have any limitations on the length of the substring to search for? Can the substring cover any part of the email address, e.g. doe@exam?

Have you used the profile API to get some insights into what is taking up time?

random suggestion, but why not split someone@somedomain.com into 2 parts, on the @ symbol, and just search the before-the-@ part ? Unless you actually want to match cleverguy@johndoesmath.com ?

Even using the wildcard datatype, having a query starting with a leading wildcard is one of the worst possible queries that you can have and should be avoided.

You may need to change this and use a n-gram tokenizer as mentioned in the wildcard documentation notes.

You mentioned that your index have 6 shards and 0 replicas, but what is the size of the index? Also, replicas is something that can improve the search speed.

2 Likes

use infix mapping (n-gram) for the field, you can search and learn from the documentation.

1 Like