Hey,
thanks a bunch for the complete example, this makes things so easy to understand! Minor nit: Specifying the Elasticsearch version would help a lot.
So let's take this for a spin. Creating the index, allows us to run the _analyze
API to understand what is stored in the inverted index.
GET test/_analyze
{
"text": [ "Wrocław", "Dolnośląskie", "53900" ],
"analyzer": "ascii_analyzer"
}
response is
{
"tokens" : [
{
"token" : "wroclaw",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "wrocław",
"start_offset" : 0,
"end_offset" : 7,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "dolnoslaskie",
"start_offset" : 8,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 101
},
{
"token" : "dolnośląskie",
"start_offset" : 8,
"end_offset" : 20,
"type" : "<ALPHANUM>",
"position" : 101
},
{
"token" : "53900",
"start_offset" : 21,
"end_offset" : 26,
"type" : "<NUM>",
"position" : 202
}
]
}
This looks good, as this means, that wroclaw
and dolnoslaskie
without the special chars will be put in the inverted index.
So, maybe the query is the culprit? Let's use the explain API to find out more
GET test/_explain/1
{
"query": {
"multi_match": {
"query": "wroclaw dolnoslaskie 53900",
"type": "cross_fields",
"operator": "and",
"fields": [
"City",
"County",
"PostCode"
]
}
}
}
returns
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"matched" : false,
"explanation" : {
"value" : 0.0,
"description" : "No matching clause",
"details" : [ ]
}
}
All right, so apparently, no query matches. Let's use the validate API to check what queries are created.
GET test/_validate/query?rewrite=true
{
"query": {
"multi_match": {
"query": "wroclaw dolnoslaskie 53900",
"type": "cross_fields",
"operator" : "and",
"fields": [
"City",
"County",
"PostCode"
]
}
}
}
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "test",
"valid" : true,
"explanation" : "((+PostCode:wroclaw +PostCode:dolnoslaskie +PostCode:53900) | (+(City:wroclaw | County:wroclaw) +(City:dolnoslaskie | County:dolnoslaskie) +(City:53900 | County:53900)))"
}
]
}
GET test/_validate/query?rewrite=true
{
"query": {
"multi_match": {
"query": "wroclaw dolnoslaskie 53900",
"type": "cross_fields",
"fields": [
"City",
"County",
"PostCode"
]
}
}
}
# GET test/_validate/query?rewrite=true
{
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"valid" : true,
"explanations" : [
{
"index" : "test",
"valid" : true,
"explanation" : "((PostCode:wroclaw PostCode:dolnoslaskie PostCode:53900) | ((City:wroclaw | County:wroclaw) (City:dolnoslaskie | County:dolnoslaskie) (City:53900 | County:53900)))"
}
]
}
Ok, so this sheds some light on why the first query does not match. The query that gets constructed ((+PostCode:wroclaw +PostCode:dolnoslaskie +PostCode:53900) | (+(City:wroclaw | County:wroclaw) +(City:dolnoslaskie | County:dolnoslaskie) +(City:53900 | County:53900)))
will not have any result. What I cannot tell you on top of my head is, why exactly this query is constructed the way.