How can I achieve fast substring search on a 1.5 billion documents index?

JoshMax · August 6, 2025, 6:18pm

Hi everyone!

I’m running an Elasticsearch 9.1.0 cluster (6 shards, 0 replicas, refresh disabled) over roughly 1.5 billion email address documents. I need to support fast, case‐insensitive substring searches that match only contiguous character sequences (i.e. exact substrings), for example:

Should match when searching for johndoe:

xxxjohndoe@example.com
xxxjohndoexxx@example.com
johndoexxx@example.com

Should not match:

xxxjohnxxxdoexxx@example.com
john.doe@example.com
john_doe@example.com

What I’ve tried

Index mapping

{
    "settings": {
        "index.number_of_shards": 6,
        "index.number_of_replicas": 0,
        "index.refresh_interval": "-1"
    },
    "mappings": {
        "properties": {
            "email": {
                "type": "keyword",
                "ignore_above": 256,
                "normalizer": "lowercase",
                "fields": {
                    "wc": {
                        "type": "wildcard",
                        "ignore_above": 256
                    }
                }
            },
            "gender": {
                "type": "keyword"
            }
        }
    }
}

Search query

{
  "size": 500,
  "terminate_after": 1000,
  "track_total_hits": false,
  "_source": false,
  "fields": ["email", "gender"],
  "query": {
    "constant_score": {
      "filter": {
        "wildcard": {
          "email.wc": {
            "value": "*johndoe*",
            "case_insensitive": true
          }
        }
      }
    }
  }
}

Despite using the specialized wildcard field and "rewrite": "constant_score", each query still takes 20–30 seconds, which is far too slow for my needs.

What I’m looking for

Suggestions on index / search structure that would give me fast, exact substring matching at this scale.
Alternatives to wildcard queries, are there better ES features or plugins for this use case?

Thanks in advance for any help!

Christian_Dahlqvist · August 6, 2025, 6:36pm

What size are the shards?

What is the average size of the indexed documents?

What is the specification of the cluster in terms of node count, RAM, CPU and type of storage used?

How many matches does a typical search return?

Does latency improve if you reduce the size of the result set?

Do you have any limitations on the length of the substring to search for? Can the substring cover any part of the email address, e.g. doe@exam?

Have you used the profile API to get some insights into what is taking up time?

RainTown · August 6, 2025, 7:03pm

random suggestion, but why not split someone@somedomain.com into 2 parts, on the @ symbol, and just search the before-the-@ part ? Unless you actually want to match cleverguy@johndoesmath.com ?

leandrojmp · August 6, 2025, 7:38pm

Even using the wildcard datatype, having a query starting with a leading wildcard is one of the worst possible queries that you can have and should be avoided.

You may need to change this and use a n-gram tokenizer as mentioned in the wildcard documentation notes.

You mentioned that your index have 6 shards and 0 replicas, but what is the size of the index? Also, replicas is something that can improve the search speed.

chongshengdz · August 10, 2025, 12:34pm

use infix mapping (n-gram) for the field, you can search and learn from the documentation.

Topic		Replies	Views
Case-insensitive search of a substring in many (or all) fields Elasticsearch	8	4696	May 11, 2020
SELECT * FROM TABLE WHERE COL LIKE '%abc%' Elasticsearch	7	881	January 27, 2017
Elastic search parse a string Elasticsearch	9	942	July 24, 2022
Searching emails by domain Elasticsearch	4	4640	January 17, 2017
How to search for text within text? Elasticsearch	18	113	September 16, 2024

How can I achieve fast substring search on a 1.5 billion documents index?

What I’m looking for

Related topics