Hi everyone. I have benefited a lot from the forum and now it's my first post
So I have some fields which look like these:
orderId: ABC-DEF-1234
date: 2024-02-24
lastUpdatedTime: 2024-02-24T12:09:48.763-05:00
My users would like to have the functionality that whatever they put in the search bar (say X), find them all the documents if any of those fields in a document contains exactly the input X. So an input like DE
, 02-24
, 09:48
will all find my example above.
And to come up with the best solution, I basically consider two things:
- How/what to index these fields
- How/what to construct my query on these fields
And to bear in mind, my data will probably have these constraints in production
- Number of records - I expect it to be 100,000 ~ 500,000
- Number of fields need such support as described above - around 10
After some research I basically have two options in mind:
1. Index the field as keyword and use wildcard query (not indexed as wildcard since does not meet the condition recommended for wildcard (value > 32KB or > 1 million records)
Index time:
- Simply index those terms as keyword
Search time:
- Use a wildcard query
*X*
for each of the fields
Q: I learned that beginning on *
should be avoided or performance would take a hit. How bad of the performance are we talking about given my number of records (half a million) and ~10 columns, since I don't have an easy way right now to simulate the number of records that I will have in production?
2. Index the field as text with N-gram analyzer and use multi-match query
Index time:
- Index all the fields with N-gram analyzer.
- If I understanding correctly, a N-gram analyzer will tokenize
ABC-DEF-1234
intoA
,AB
,ABC
,DEF-1
,ABC-DEF-1234
...(all the adjacent substring when with parameter minimum length 1 and maximum length)
Search time:
- Do a multi-match query of user input on all these fields directly
Q: This seems to increase the index time significantly and also the storage cost for all those tokenized terms. However, if this would outperform option 1 in terms of query performance by a considerable margin, I could probably live with it.
So if my analysis is correct, I am inclined to Option 2 at the moment. But just want to share to see if there's a better way or a way which better trade off the performance and cost. Also please let me know if any of my assumptions is incorrect. Cheers!