Hi, I'm dealing with unstructured documents where we're indexing the contents of the text of the document. Generally right now we just have a fairly basic configuration where we use the standard analyzer on the body of text. This works pretty well, however, we're also looking to support "contains" (within a word) type searching for some particular use cases. In particular, we're looking to extract out numbers that are contained within letters e.g.:
The goal is to be able to search by "1234" and get results.
Wondering about different approaches here:
- The first thought here was wildcards, obviously though it's discouraged to use them from a performance standpoint, especially if the wildcard is on the front of the search term.
- Using ngrams. I think the concern here mainly is term explosion (index size + indexing performance - although we're not too concerned about write performance) + deciding the correct min/max size + maybe additional search noise
- Some sort of custom (or built-in, trying to find something?) tokenizer or filter that can take ABC1234EFG and produce ABC 1234 EFG. This is in a body of text though so we would still want the standard tokenizer behavior on other words e.g.:
"This is my document ABC1234EFG"
- Something with fuzziness?
Then, I guess as a general question what are people generally doing when people want "contains" type searching beyond full word matching when dealing with a full body of text (so generally cannot assume much about the value of the field other than it's a blob of text).