Wildcard / regexp in a phrase which has space

How to match phrases using wildcards.
Example:
I wanna search clean* car* in

  1. Cleaning car and room
  2. Cleaning room and car

It should return only 1

I tried and what it returns is, it will search for clean* AND car*

My mapping:

PUT my_test
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"index": "true"
},
"raw2": {
"type": "keyword"

}
}
}
}
}
}
}

PUT my_test/my_type/1
{
"city": "Cleaning room and car"
}
PUT my_test/my_type/1
{
"city": "Cleaning car and room"
}

I'd probably use a edge ngram based analyzer and a phrase query. That should work I guess. I never tried it though.

Thanks. Let me try

ES directly supporting these kind of wildcards?

None of the forums has a clear answer for this, that's why that straight question. Will ES support this?

David,

It is not working with ngram

If I am searching for clean* car*
I wanna match "Cleaning car and room" only, though second option "Cleaning room and car" has both the words. even with ngram it is matching both the docs

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

Create an index:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "keyword",
          "fields": {
            "raw": { 
              "type":  "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
  }
  
}


POST my_index/my_type/1
{
  "text": "2 #Quick Foxes lived and died"
}

POST my_index/my_type/2
    {
      "text": "2 #Quick Foxes lived died"
    }

Now when we search

GET my_index/my_type/_search
{
  "query": {
    "query_string": {
      "default_operator" : "AND",
       "query" : "f* d*",
      "fields": ["text.raw"]
    }
  }
}

Only ID 2 should list. But nothing returns.

when you try this

GET my_index/my_type/_search
{
 "query": {
   "query_string": {
	 "default_operator" : "AND",
	  "query" : "f* d*",
	 "fields": ["text"]
   }
 }
}

It will return both

But you did not try a phrase query here.

are you talking about something like this?

GET my_index_1/my_type/_search
{
  "query": {
    "match_phrase": {
       "text.raw" : "f* d*"
      
    }
  }
}

Anyway, above query is not working

Yes but not with any wildcard

David,

I am looking for wildcard. Is there any option which supports wildcards which I have mentioned in the previous posts?

requirement is something like this: We have a huge data set of movies. Customers will search using wildcards. how do we support those kinds of wildcards. Examples are mentioned above. Even a phrase has a wildcards, I wanted to support it.

So. I don't know from the top of my head and I probably need to test. Unfortunately I don't have a lot of time as we speak.

But let me share my thoughts about search engines. Never ever a end user should have to worry about wildcards. As an example, when I'm searching for a movie on Google, I never ever use wildcards. Instead the search engine helps me to find what I'm looking for.

I'm not saying that you should not do it. May be you have a different use case than a regular search, but I just wanted to share my thoughts.

Everything makes sense, but still I couldn't find any solution for my problem from any of the blogs / forums. Looks like either nobody tried this (such a big user base, I can't believe that nobody tried) or elasticsearch does not support it

Here is what I meant:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "liv die"
    }
  }
}

It gives:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 3.231833,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 3.231833,
        "_source": {
          "text": "2 #Quick Foxes lived died"
        }
      }
    ]
  }
}

Great. You mean to say that it will work with out * and have the same effect of a wildcard. :slight_smile:

It has issues.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}

And the query is

GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

Result is

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 4.558014,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 4.558014,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "3",
        "_score": 3.9283106,
        "_source": {
          "text": "2 #Quick Foxes lived died and resurrected their wys "
        }
      }
    ]
  }
}

of-course the second one is wrong.

becuse query was

th wa

so it should not bring anything with

wy

isn't it?

Sure.

This is what you want:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "simple"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}
POST test/_refresh
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

It gives:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.8853602,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 1.8853602,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      }
    ]
  }
}
1 Like

BTW in general I'd not start with only one letter but at least 2 letters in the edge ngram.

Thanks @dadoonet. Working on it. Will let you know.

@dadoonet The solution which you have given is fine for applications like word complete!

say for example, if I am searching like

rts Re

it should return

Sports Report

Any idea how can I do that?