ES 5.0 wildcard query, problem with CJK characters


(Hacksign) #1

If there is a data in ES like this :

{'realname':'XZY'}

note : X/Z/Y are CJK charcters, NOT English letters.

If I want pick above item out , I wrote dsl below :

    {
        "size" : 10,
        "query" : {
            "wildcard" : {
                "realname" : "X*"
            }
        }
    }

this works fine, but If DSL is like this :

{
    "size" : 10,
    "query" : {
        "wildcard" : {
            "realname" : "X*Y"
        }
    }
}

this can not find anything.

anything wrong ? or I misunderstank something from this document ?


(David Pilato) #2

Try the _analyze API to see how your document is actually indexed.
Then remember that wildcard string is not analyzed so it's compared to the previous output.

Finally: don't use wildcards!


(Hacksign) #3

thanks for reply.
this is the output of _analyze :

XZY are still CJK characters ...

[root@host ~]# curl http://localhost:9200/dbs/_analyze?pretty -d '{"field":"some_field", "text":"XZY"}'
{
  "tokens" : [
    {
      "token" : "X",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "Z",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "Y",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    }
  ]
}

the problem confuse me is ,
under Elasticsearch 2.3, wildcard search like this:

    "size" : 10,
    "query" : {
        "wildcard" : {
            "realname" : "X*Y"
        }
    }
}

will return results.

but after upgrade es to 5.0, only querys below could return results :

    {
        "size" : 10,
        "query" : {
            "wildcard" : {
                "realname" : "X*"
            }
        }
    }

if this is a problem relative to mapping and participle, why "X*" could hit results while "X*Y" could not ?


(David Pilato) #4

I don't know how it worked previously in 2.x series.
May be the analyzer you were using was producing [ "XYZ" ] instead of [ "X", "Y", "Z" ]?


(Hacksign) #5

As _analyze api returned.
CJK character is analyzed as ['X', 'Z', 'Y'], not ['XZY'].
this seems to be the default analyzer behaviour(split CJK characters into single word one after another).

So, still confused of understanding why can not get correct result by providing 'X*Y' to wildcard query.


(David Pilato) #6

So if you have in the inverted index:

  • X
  • Y
  • Z

X*Y won't match any on those, right?
X*, Y*, Z* will.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.