Hi, I'm dealing with unstructured documents where we're indexing the contents of the text of the document. Generally right now we just have a fairly basic configuration where we use the standard analyzer on the body of text. This works pretty well, however, we're also looking to support "contains" (within a word) type searching for some particular use cases. In particular, we're looking to extract out numbers that are contained within letters e.g.:
ABC1234EFG
The goal is to be able to search by "1234" and get results.
Wondering about different approaches here:
The first thought here was wildcards, obviously though it's discouraged to use them from a performance standpoint, especially if the wildcard is on the front of the search term.
Using ngrams. I think the concern here mainly is term explosion (index size + indexing performance - although we're not too concerned about write performance) + deciding the correct min/max size + maybe additional search noise
Some sort of custom (or built-in, trying to find something?) tokenizer or filter that can take ABC1234EFG and produce ABC 1234 EFG. This is in a body of text though so we would still want the standard tokenizer behavior on other words e.g.:
"This is my document ABC1234EFG"
Something with fuzziness?
Then, I guess as a general question what are people generally doing when people want "contains" type searching beyond full word matching when dealing with a full body of text (so generally cannot assume much about the value of the field other than it's a blob of text).
Then also just curious how people tend to handle user's requesting wildcard type search when using Elasticsearch for user facing applications. Basically, this came up because when users don't know exact word matches they then want to reach for wildcards (and ask for it as a feature in the search syntax). A wildcard placed on the front of a term would perform poorly. A wildcard placed on the back of a term (or middle) could perform okay, also it could combined with index prefix feature to improve performance - https://www.elastic.co/guide/en/elasticsearch/reference/current/index-prefixes.html
So I guess, for user facing search applications do:
You let users specify wildcard placement? This probably depends on how technical your users are if this makes sense. Do you only allow it on the end of terms (e.g. like Sharepoint above) - that may be confusing as well. Do you combine this with something like index prefixes to keep response times low.
Don't have users specify wildcard placement but instead just always do a prefix type search (using index prefixes to maintain fast responses) in combination with something like fuzziness to allow for word variation/misspellings? Doesn't allow for searches in the middle e.g. foo*bar but maybe gets most of the way there in terms of usability?
Something like reverse filtered edge ngrams to make front loaded wildcards acceptable?
Using ngrams (what size to pick, potential index overhead?)
I've never had to support user queries that start with wildcards, so I don't know how people do it. But I think that your 3rd guess is right - use Reverse Token Filter to change you "wildcard prefix" problem into "wildcard suffix" one. For wildcard in the middle (XX*YYY) you may even use some simple heuristic, like checking whether the wildcard is closer to the start or the end (end either use "regular" or "reversed" tokens). This still does not solve problem of *XXX* (wildcard in the front and in the end) - maybe n-grams (not from the edge) can be helpful there?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.