Wildcard search not working for "-" in the wildcard


(sripri) #1

I am trying to do a wildcard search on 3 fields and it looks like a search string with "-" and wildcard does not work.

I am using a java restclient implementation. The query string is as follows:

curl -H 'Content-Type: application/json' -XGET 'http://xyz:9200/index_name/_search' -d '{"size":1000,"query":{"bool":{"should":[{"wildcard":{"field1":{"wildcard":"*wo-*","boost":1.0}}},{"wildcard":{"field2":{"wildcard":"*wo-*","boost":1.0}}},{"wildcard":{"oper_no":{"wildcard":"*wo-*","boost":1.0}}}],"adjust_pure_negative":true,"boost":1.0}},"sort":[{"id_time":{"order":"desc"}}]}'

Any help would be great.


(David Pilato) #2

Probably a question of analyzer used. Check what tokens your analyzer is producing with the _analyze API.

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

BTW *wo-* is one of the worse query you can run on elasticsearch as per doc says:

Note that this query can be slow, as it needs to iterate over many terms. In order to prevent extremely slow wildcard queries, a wildcard term should not start with one of the wildcards * or ?.


(sripri) #4

Thanks for your feedback. I realize that the search by wildcard is probably the worst choice. However, I checked and elasticsearch seems to support only prefix based search as the closest alternative. Any suggestions on an alternative which can fetch results based upon occurrence of a string in another?
I am confused as to what you need as full recreation script. The curl encompasses the action needed. If there is more detail needed please let me know and I would gladly provide it.


(David Pilato) #5

You can look at ngrams based analyzers to simulate the wildcard thing,

A full script is not a simple query. It contains also index creation, data creation... as shown in the example. That's the only way for someone to reproduce your problem and propose a script that fixes it.


(sripri) #6

Logstash.conf:

       input {
      	jdbc {
    		type => "xxxxxx"
    		jdbc_driver_library => "/opt/logstash/driver/ojdbc8.jar"
    		jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
    		jdbc_connection_string => "jdbc:oracle:thin:@//yyyyyy/<db>"
    		jdbc_user => "ddd"
    		jdbc_password => "dddd"
    		jdbc_fetch_size => 1000
    		schedule => "*/5 * * * *"
    		statement => "select distinct a.rowid || b.rowid || u.rowid  as rid_obj, a.order_id, a.order_no, a.oper_key, a.oper_no, b.updt_userid, cast(b.TIME_STAMP as TIMESTAMP WITH LOCAL TIME ZONE) as id_time from tbl1 a, tbl2 b, tbl3 u where a.order_id = b.order_id and a.oper_key = b.oper_key and u.userid = b.updt_userid and cast(b.TIME_STAMP as TIMESTAMP WITH LOCAL TIME ZONE) > :sql_last_value "
    		clean_run => false
    		record_last_run => true
    		sql_log_level => debug
    		last_run_metadata_path => "/opt/logstash/lastrun/.logstash_jdbc_last_run"
    	  }	
     

output {
		elasticsearch {
			hosts => "<host>:9200"
			index => "<xxxxx-{now/M{YYYY.MM}}>" 
			document_id => "%{rid_obj}"
		}
	}

I checked the link you sent me and it does not give any more specifics on whats needed. Hope the ingesting script will do. I am not setting anything specific in elasticsearch. If you need more info let me know. Glad to send.


(David Pilato) #7

I have absolutely no way to reproduce the problem you are seeing. Please build from scratch a simple example. As explained here: About the Elasticsearch category

DELETE index
PUT index/doc/1
{
  "foo": "bar"
}
GET index/_search
{
  "query": {
    "match": {
      "foo": "bar"
    }
  }
}

This is a typical script that can be used by anyone to reproduce your problem and help you fix it. The more work you do on your side, the faster we can help you to solve the problem.

Anyway here is similar discussion I had recently about wildcards. You'll see a full example:


(sripri) #8

Thanks for the feedback. I ran the scripts and still no luck with the wild card: My test script follows,

  PUT test/doc/5
    { "text": """Sports Report:
    N-WO-001 The Adelaide Strikers have blasted their way back to the top of the 
    Big Bash ladder, after beating the Melbourne Stars."""
 }

GET test/_search
    {
      "query": {
       "wildcard": {
        "text": "*wo*" 
    }
}
}  ----- > This works - As expected
  GET test/_search
 {
    "query": {
    "wildcard": {
        "text": "*wo-*" 
    }
  }
} ---->  This does not work

I came across a posting where the standard analyzer indexes "xxx-yyy" as 2 tokens - xxx and yyy stripping out the "-" if this is the case then what should the analyzer setting be in order to store for wildcard patterns like "xx-"


(sripri) #9

Running ngrams analyzer produced this. I think this is what you were referring to way back:

POST _analyze
   {
     "tokenizer": "ngram",
       "text":  "N-WO-001"
   }

This produces:

     {
        "tokens": [
      {
       "token": "N",
  "start_offset": 0,
  "end_offset": 1,
  "type": "word",
  "position": 0
},
{
  "token": "N-",
  "start_offset": 0,
  "end_offset": 2,
  "type": "word",
  "position": 1
},
{
  "token": "-",
  "start_offset": 1,
  "end_offset": 2,
  "type": "word",
  "position": 2
},
{
  "token": "-W",
  "start_offset": 1,
  "end_offset": 3,
  "type": "word",
  "position": 3
},
{
  "token": "W",
  "start_offset": 2,
  "end_offset": 3,
  "type": "word",
  "position": 4
},
{
  "token": "WO",
  "start_offset": 2,
  "end_offset": 4,
  "type": "word",
  "position": 5
},
{
  "token": "O",
  "start_offset": 3,
  "end_offset": 4,
  "type": "word",
  "position": 6
},
{
  "token": "O-",
  "start_offset": 3,
  "end_offset": 5,
  "type": "word",
  "position": 7
},
{
  "token": "-",
  "start_offset": 4,
  "end_offset": 5,
  "type": "word",
  "position": 8
},
{
  "token": "-0",
  "start_offset": 4,
  "end_offset": 6,
  "type": "word",
  "position": 9
},
{
  "token": "0",
  "start_offset": 5,
  "end_offset": 6,
  "type": "word",
  "position": 10
},
{
  "token": "00",
  "start_offset": 5,
  "end_offset": 7,
  "type": "word",
  "position": 11
},
{
  "token": "0",
  "start_offset": 6,
  "end_offset": 7,
  "type": "word",
  "position": 12
},
{
  "token": "01",
  "start_offset": 6,
  "end_offset": 8,
  "type": "word",
  "position": 13
},
{
  "token": "1",
  "start_offset": 7,
  "end_offset": 8,
  "type": "word",
  "position": 14
}

]
}

Exactly what I want. Now I have to figure out how to code the analyzer to use this.
How performant is ngrams based tokenization?


(David Pilato) #10

Much more performant than wildcard.


(sripri) #11

How do you automate the analyzer implementation? Do I have to create the index manually? Then run the ngrams scripts? I have set the index to switch over every month. Do I have to rerun the analyzer every time or does it automatically carry over? Whats the best practice? Any input would be great.


(David Pilato) #12

You can use an index template to tell elasticsearch what to do anytime you are going to create a new index.


(sripri) #13

Ok. Some progress but still not sure how to get this to work completely.
I tested the following:

PUT /my_index
 {
 "settings": {
  "analysis": {
     "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_ngram_tokenizer"
        }
     }
           ,
           "tokenizer": {
             "my_ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "1",
                    "max_gram" : "2",
                    "token_chars": []
                },
             "my_white_tokenizer": {
               "type": "whitespace"
             }
           }
        }
     }
  }

I populated some test data as follows:

POST my_index/_bulk
 {"index":{"_index":"my_index","_type":"doc"}}
   {"content":"We are looking for C++ and C# developers"}
    {"index":{"_index":"my_index","_type":"doc"}}
{"content":"We are looking for WO-D-N001-L124 developers"}
{"index":{"_index":"my_index","_type":"doc"}}
{"content":"We are looking for WO-C-N001-L123 developers"}
{"index":{"_index":"my_index","_type":"doc"}}
{"content":"We are looking for project managers"}

Since, I would like to achieve the following:
Search for "wo-"
works !
Search for "wo-c"
DOES NOT WORK !! Fetches (All occurrences of w o c -):

    "content": "We are looking for WO-C-N001-L123 developers"
    "content": "We are looking for C++ and C# developers" ------    **How do I prevent this?_**   
     "content": "We are looking for WO-D-N001-L124 developers" ------    **How do I prevent this?_**

Search for "wo-d"

Same as above .... I am not sure if there is a solution. Increasing the ngram would make it more specific.


(David Pilato) #14

Change the search analyzer in your mapping to simple.


(sripri) #15

I was able to get it to work by doing a multi_match (since I have many fields) and a type of phrase_prefix. It is a bit misleading "phrase_prefix" since I thought that something like "no-2" would only pick up "test no-2-500" etc. However, my observation is that it also picks up "test xyz-no-2-500".

In addition to the analyzer with custom ngrams. Still, it feels like I am learning to use a swiss army knife.


(system) #16

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.