Wildcard query on keyword vs N-gram analyzer + multi-match

Hi everyone. I have benefited a lot from the forum and now it's my first post :slight_smile:

So I have some fields which look like these:

orderId: ABC-DEF-1234

date: 2024-02-24

lastUpdatedTime: 2024-02-24T12:09:48.763-05:00

My users would like to have the functionality that whatever they put in the search bar (say X), find them all the documents if any of those fields in a document contains exactly the input X. So an input like DE, 02-24, 09:48 will all find my example above.

And to come up with the best solution, I basically consider two things:

  1. How/what to index these fields
  2. How/what to construct my query on these fields

And to bear in mind, my data will probably have these constraints in production

  1. Number of records - I expect it to be 100,000 ~ 500,000
  2. Number of fields need such support as described above - around 10

After some research I basically have two options in mind:

1. Index the field as keyword and use wildcard query (not indexed as wildcard since does not meet the condition recommended for wildcard (value > 32KB or > 1 million records)

Index time:

  • Simply index those terms as keyword

Search time:

  • Use a wildcard query *X* for each of the fields

Q: I learned that beginning on * should be avoided or performance would take a hit. How bad of the performance are we talking about given my number of records (half a million) and ~10 columns, since I don't have an easy way right now to simulate the number of records that I will have in production?

2. Index the field as text with N-gram analyzer and use multi-match query

Index time:

  • Index all the fields with N-gram analyzer.
  • If I understanding correctly, a N-gram analyzer will tokenize ABC-DEF-1234 into A, AB, ABC, DEF-1, ABC-DEF-1234...(all the adjacent substring when with parameter minimum length 1 and maximum length)

Search time:

  • Do a multi-match query of user input on all these fields directly

Q: This seems to increase the index time significantly and also the storage cost for all those tokenized terms. However, if this would outperform option 1 in terms of query performance by a considerable margin, I could probably live with it.

So if my analysis is correct, I am inclined to Option 2 at the moment. But just want to share to see if there's a better way or a way which better trade off the performance and cost. Also please let me know if any of my assumptions is incorrect. Cheers!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.