Wildcard query on keyword vs N-gram analyzer + multi-match

you-last-did · February 24, 2024, 6:06pm

Hi everyone. I have benefited a lot from the forum and now it's my first post

So I have some fields which look like these:

orderId: ABC-DEF-1234

date: 2024-02-24

lastUpdatedTime: 2024-02-24T12:09:48.763-05:00

My users would like to have the functionality that whatever they put in the search bar (say X), find them all the documents if any of those fields in a document contains exactly the input X. So an input like DE, 02-24, 09:48 will all find my example above.

And to come up with the best solution, I basically consider two things:

How/what to index these fields
How/what to construct my query on these fields

And to bear in mind, my data will probably have these constraints in production

Number of records - I expect it to be 100,000 ~ 500,000
Number of fields need such support as described above - around 10

After some research I basically have two options in mind:

1. Index the field as keyword and use wildcard query (not indexed as wildcard since does not meet the condition recommended for wildcard (value > 32KB or > 1 million records)

Index time:

Simply index those terms as keyword

Search time:

Use a wildcard query *X* for each of the fields

Q: I learned that beginning on * should be avoided or performance would take a hit. How bad of the performance are we talking about given my number of records (half a million) and ~10 columns, since I don't have an easy way right now to simulate the number of records that I will have in production?

2. Index the field as text with N-gram analyzer and use multi-match query

Index time:

Index all the fields with N-gram analyzer.
If I understanding correctly, a N-gram analyzer will tokenize ABC-DEF-1234 into A, AB, ABC, DEF-1, ABC-DEF-1234...(all the adjacent substring when with parameter minimum length 1 and maximum length)

Search time:

Do a multi-match query of user input on all these fields directly

Q: This seems to increase the index time significantly and also the storage cost for all those tokenized terms. However, if this would outperform option 1 in terms of query performance by a considerable margin, I could probably live with it.

So if my analysis is correct, I am inclined to Option 2 at the moment. But just want to share to see if there's a better way or a way which better trade off the performance and cost. Also please let me know if any of my assumptions is incorrect. Cheers!

system · March 23, 2024, 6:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best Optimization for wildcard queries Elasticsearch	5	494	February 6, 2023
Wildcard searches Elasticsearch	6	1507	July 31, 2018
Feedback needed on my query Elasticsearch	3	554	August 18, 2017
Slow Query Performance Elasticsearch	2	74	October 21, 2024
Queries on keyword field are extremely slow Elasticsearch	5	1175	January 1, 2022

Wildcard query on keyword vs N-gram analyzer + multi-match

Related topics