Wildcard / regexp in a phrase which has space

anoopvalluthadam · February 1, 2018, 8:21am

How to match phrases using wildcards.
Example:
I wanna search clean* car* in

Cleaning car and room
Cleaning room and car

It should return only 1

I tried and what it returns is, it will search for clean* AND car*

My mapping:

PUT my_test
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"index": "true"
},
"raw2": {
"type": "keyword"

}
}
}
}
}
}
}

PUT my_test/my_type/1
{
"city": "Cleaning room and car"
}
PUT my_test/my_type/1
{
"city": "Cleaning car and room"
}

dadoonet · February 1, 2018, 8:41am

I'd probably use a edge ngram based analyzer and a phrase query. That should work I guess. I never tried it though.

anoopvalluthadam · February 1, 2018, 8:54am

Thanks. Let me try

anoopvalluthadam · February 1, 2018, 8:55am

ES directly supporting these kind of wildcards?

None of the forums has a clear answer for this, that's why that straight question. Will ES support this?

anoopvalluthadam · February 1, 2018, 9:17am

David,

It is not working with ngram

If I am searching for clean* car*
I wanna match "Cleaning car and room" only, though second option "Cleaning room and car" has both the words. even with ngram it is matching both the docs

dadoonet · February 1, 2018, 9:23am

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

anoopvalluthadam · February 1, 2018, 9:29am

Create an index:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "keyword",
          "fields": {
            "raw": { 
              "type":  "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
  }
  
}


POST my_index/my_type/1
{
  "text": "2 #Quick Foxes lived and died"
}

POST my_index/my_type/2
    {
      "text": "2 #Quick Foxes lived died"
    }

Now when we search

GET my_index/my_type/_search
{
  "query": {
    "query_string": {
      "default_operator" : "AND",
       "query" : "f* d*",
      "fields": ["text.raw"]
    }
  }
}

Only ID 2 should list. But nothing returns.

when you try this

GET my_index/my_type/_search
{
 "query": {
   "query_string": {
	 "default_operator" : "AND",
	  "query" : "f* d*",
	 "fields": ["text"]
   }
 }
}

It will return both

dadoonet · February 1, 2018, 9:59am

But you did not try a phrase query here.

anoopvalluthadam · February 1, 2018, 10:28am

are you talking about something like this?

GET my_index_1/my_type/_search
{
  "query": {
    "match_phrase": {
       "text.raw" : "f* d*"
      
    }
  }
}

Anyway, above query is not working

dadoonet · February 1, 2018, 10:46am

Yes but not with any wildcard

anoopvalluthadam · February 1, 2018, 11:00am

David,

I am looking for wildcard. Is there any option which supports wildcards which I have mentioned in the previous posts?

requirement is something like this: We have a huge data set of movies. Customers will search using wildcards. how do we support those kinds of wildcards. Examples are mentioned above. Even a phrase has a wildcards, I wanted to support it.

dadoonet · February 1, 2018, 2:41pm

So. I don't know from the top of my head and I probably need to test. Unfortunately I don't have a lot of time as we speak.

But let me share my thoughts about search engines. Never ever a end user should have to worry about wildcards. As an example, when I'm searching for a movie on Google, I never ever use wildcards. Instead the search engine helps me to find what I'm looking for.

I'm not saying that you should not do it. May be you have a different use case than a regular search, but I just wanted to share my thoughts.

anoopvalluthadam · February 1, 2018, 3:27pm

Everything makes sense, but still I couldn't find any solution for my problem from any of the blogs / forums. Looks like either nobody tried this (such a big user base, I can't believe that nobody tried) or elasticsearch does not support it

dadoonet · February 1, 2018, 4:41pm

Here is what I meant:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "liv die"
    }
  }
}

It gives:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 3.231833,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 3.231833,
        "_source": {
          "text": "2 #Quick Foxes lived died"
        }
      }
    ]
  }
}

anoopvalluthadam · February 2, 2018, 12:56am

Great. You mean to say that it will work with out * and have the same effect of a wildcard.

anoopvalluthadam · February 2, 2018, 2:28am

It has issues.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}

And the query is

GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

Result is

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 4.558014,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 4.558014,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "3",
        "_score": 3.9283106,
        "_source": {
          "text": "2 #Quick Foxes lived died and resurrected their wys "
        }
      }
    ]
  }
}

of-course the second one is wrong.

becuse query was

th wa

so it should not bring anything with

wy

isn't it?

dadoonet · February 2, 2018, 7:19am

Sure.

This is what you want:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "simple"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}
POST test/_refresh
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

It gives:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.8853602,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 1.8853602,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      }
    ]
  }
}

dadoonet · February 2, 2018, 7:21am

BTW in general I'd not start with only one letter but at least 2 letters in the edge ngram.

anoopvalluthadam · February 2, 2018, 8:20am

Thanks @dadoonet. Working on it. Will let you know.

anoopvalluthadam · February 2, 2018, 9:42am

@dadoonet The solution which you have given is fine for applications like word complete!

say for example, if I am searching like

rts Re

it should return

Sports Report

Any idea how can I do that?

Topic		Replies	Views
How to do partial matching (including space) in a phrase Elasticsearch	4	2784	July 17, 2017
How can wildcard be used in a phrase query? Elasticsearch	3	2483	November 14, 2017
Search string with space in a long text Elasticsearch	11	23282	December 20, 2018
Is it possible to add a Space in Regex - Elasticsearch 1.7? Elasticsearch	8	3335	July 5, 2017
Elasticsearch query like not working when search number and string in 1 words using wildcardquery and match_phrase Elasticsearch	1	587	September 20, 2019

Wildcard / regexp in a phrase which has space

Related topics