Wildcard / regexp in a phrase which has space


(Anoop Valluthadam) #1

How to match phrases using wildcards.
Example:
I wanna search clean* car* in

  1. Cleaning car and room
  2. Cleaning room and car

It should return only 1

I tried and what it returns is,it will search for clean* AND car*

My mapping:

PUT my_test
{
"mappings": {
"my_type": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": {
"type": "text",
"index": "true"
},
"raw2": {
"type": "keyword"

}
}
}
}
}
}
}

PUT my_test/my_type/1
{
"city": "Cleaning room and car"
}
PUT my_test/my_type/1
{
"city": "Cleaning car and room"
}


(David Pilato) #2

I'd probably use a edge ngram based analyzer and a phrase query. That should work I guess. I never tried it though.


(Anoop Valluthadam) #3

Thanks. Let me try


(Anoop Valluthadam) #4

ES directly supporting these kind of wildcards?

None of the forums has a clear answer for this, that's why that straight question. Will ES support this?


(Anoop Valluthadam) #5

David,

It is not working with ngram

If I am searching for clean* car*
I wanna match "Cleaning car and room" only, though second option "Cleaning room and car" has both the words. even with ngram it is matching both the docs


(David Pilato) #6

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.


(Anoop Valluthadam) #7

Create an index:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "keyword",
          "fields": {
            "raw": { 
              "type":  "text",
              "analyzer": "my_analyzer"
            }
          }
        }
      }
    }
  }
  
}


POST my_index/my_type/1
{
  "text": "2 #Quick Foxes lived and died"
}

POST my_index/my_type/2
    {
      "text": "2 #Quick Foxes lived died"
    }

Now when we search

GET my_index/my_type/_search
{
  "query": {
    "query_string": {
      "default_operator" : "AND",
       "query" : "f* d*",
      "fields": ["text.raw"]
    }
  }
}

Only ID 2 should list. But nothing returns.

when you try this

GET my_index/my_type/_search
{
 "query": {
   "query_string": {
	 "default_operator" : "AND",
	  "query" : "f* d*",
	 "fields": ["text"]
   }
 }
}

It will return both


(David Pilato) #8

But you did not try a phrase query here.


(Anoop Valluthadam) #9

are you talking about something like this?

GET my_index_1/my_type/_search
{
  "query": {
    "match_phrase": {
       "text.raw" : "f* d*"
      
    }
  }
}

Anyway, above query is not working


(David Pilato) #10

Yes but not with any wildcard


(Anoop Valluthadam) #11

David,

I am looking for wildcard. Is there any option which supports wildcards which I have mentioned in the previous posts?

requirement is something like this: We have a huge data set of movies. Customers will search using wildcards. how do we support those kinds of wildcards. Examples are mentioned above. Even a phrase has a wildcards, I wanted to support it.


(David Pilato) #12

So. I don't know from the top of my head and I probably need to test. Unfortunately I don't have a lot of time as we speak.

But let me share my thoughts about search engines. Never ever a end user should have to worry about wildcards. As an example, when I'm searching for a movie on Google, I never ever use wildcards. Instead the search engine helps me to find what I'm looking for.

I'm not saying that you should not do it. May be you have a different use case than a regular search, but I just wanted to share my thoughts.


(Anoop Valluthadam) #13

Everything makes sense, but still I couldn't find any solution for my problem from any of the blogs / forums. Looks like either nobody tried this (such a big user base, I can't believe that nobody tried) or elasticsearch does not support it


(David Pilato) #14

Here is what I meant:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 5,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "liv die"
    }
  }
}

It gives:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 3.231833,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 3.231833,
        "_source": {
          "text": "2 #Quick Foxes lived died"
        }
      }
    ]
  }
}

(Anoop Valluthadam) #15

Great. You mean to say that it will work with out * and have the same effect of a wildcard. :slight_smile:


(Anoop Valluthadam) #16

It has issues.

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}

And the query is

GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

Result is

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 4.558014,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 4.558014,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "3",
        "_score": 3.9283106,
        "_source": {
          "text": "2 #Quick Foxes lived died and resurrected their wys "
        }
      }
    ]
  }
}

of-course the second one is wrong.

becuse query was

th wa

so it should not bring anything with

wy

isn't it?


How to filter not_analyzed index using query_string
(David Pilato) #17

Sure.

This is what you want:

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase", "ngram"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "simple"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #Quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/4
{
  "text": """Sports Report:
Cricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."""
}
POST test/_refresh
GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "th wa"
    }
  }
}

It gives:

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.8853602,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "4",
        "_score": 1.8853602,
        "_source": {
          "text": "Sports Report:\nCricket - The Adelaide Strikers have blasted their way back to the top of the Big Bash ladder, after beating the Melbourne Stars."
        }
      }
    ]
  }
}

Wildcard search not working for "-" in the wildcard
Fuzziness in prefix query
(David Pilato) #18

BTW in general I'd not start with only one letter but at least 2 letters in the edge ngram.


(Anoop Valluthadam) #19

Thanks @dadoonet. Working on it. Will let you know.


(Anoop Valluthadam) #20

@dadoonet The solution which you have given is fine for applications like word complete!

say for example, if I am searching like

rts Re

it should return

Sports Report

Any idea how can I do that?