I would like to search for instances of (for example) "hyper-space" but not find "hyper space". All traditional tokenizers except whitespace are going to split on the hyphen and it will be lost to the ether as far as indexing goes. I can use a whitespace tokenizer and then hit it with a word delimiter token filter using the 'preserve original' setting. That will get "hyper-space", "hyper" and "space" all indexed, which is great. Now I can run a term query for "hyper-space" and I get precisely what I need.
But. . .
Since I'm using a whitespace tokenizer and "preserve original" the original may also have trailing punctuation, like "hyper-space,". Now the term query won't match because of that trailing comma. Yuck.
All the possible ways of getting around this start using expensive processors in elasticsearch that I would prefer to avoid. Things like filtering by character or using regexp. Is there a more intuitive solution to this problem that I am missing?
You can use an analyzer with a mapping character filter that replaces any dashes with a character that is not removed by the tokenizer, for example an underscore.
You can now only find a document containing hyper-space if you search for hyper-space with a dash:
# Test the analyzer
GET my_index/_analyze
{
"text": "foo hyper-space bar",
"analyzer": "my_analyzer"
}
# Index a document containing "hyper-space"
PUT my_index/_doc/1
{
"my_field": "foo hyper-space bar"
}
# A query for just "hyper" does not return any hits
GET my_index/_search
{
"query": {
"match": {
"my_field": "hyper"
}
}
}
# A query for "hyper space" (without a dash) does not return any hits either
GET my_index/_search
{
"query": {
"match": {
"my_field": "hyper space"
}
}
}
# A query for "hyper-space" with a dash does return our document
GET my_index/_search
{
"query": {
"match": {
"my_field": "hyper-space"
}
}
}
Thanks! I considered something like this, but I was wondering how expensive a character filter is. I've never included one because it feels like pretty heavy pre-processing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.