Word_delimiter and catenate_all doesnt work?


(Emil) #1

I have my mapping as below

{
"state": "open",
"settings": {
"index": {
"creation_date": "1457004137507",
"analysis": {
"filter": {
"my_word_delimiter": {
"catenate_all": "true",
"split_on_numerics": "true",
"split_on_case_change": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"standard"
,
"lowercase"
,
"my_word_delimiter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "zAT_MukSSTyIVKXQz-7YKw",
"version": {
"created": "2020099"
}
}
},
"mappings": {
"Product": {
"properties": {
"ProductCode": {
"index": "not_analyzed",
"type": "string"
},
"id": {
"index": "no",
"store": true,
"type": "integer"
},
"Name": {
"store": true,
"type": "string"
},
"ShortDescription": {
"type": "string"
}
}
}

I have product with name "Brother TN-2000 Toner Black" and when I use following query with "tn 2000" or "tn-2000", I am getting it in the search result. but when I use "tn2000", it will not return me anything. I though that word_delimiter and catenate_all should give me expected inverted index. what am I doing wrong? can you please help me?

{"query":{"bool":{"should":[{"multi_match":{"type":"best_fields","query":"tn 2000","fields":["Name^7","ShortDescription^6"]}}]}}}

when I check the analyzer with the following curl query,
curl -XGET "localhost:9200/myIndex/_analyze?analyzer=my_analyzer&pretty
=true" -d 'Brother TN-2000 Toner Black'

it returns me

{
"tokens" : [ {
"token" : "'brother",
"start_offset" : 0,
"end_offset" : 8,
"type" : "word",
"position" : 0
}, {
"token" : "brother",
"start_offset" : 1,
"end_offset" : 8,
"type" : "word",
"position" : 0
} ]
}
curl: (6) Could not resolve host: TN-2000
curl: (6) Could not resolve host: Toner
curl: (6) Could not resolve host: Black'


(Jimferenczi) #2

I tried your example and it works for me:

curl -XGET "localhost:9200/test/_analyze?analyzer=my_analyzer&pretty=true" -d 'Brother TN-2000 Toner Black'
{
  "tokens" : [ {
    "token" : "brother",
    "start_offset" : 0,
    "end_offset" : 7,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "tn-2000",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "tn",
    "start_offset" : 8,
    "end_offset" : 10,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "tn2000",
    "start_offset" : 8,
    "end_offset" : 15,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2000",
    "start_offset" : 11,
    "end_offset" : 15,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "toner",
    "start_offset" : 16,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "black",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}

Are you sure that you copy/paste the curl command you've executed ?
The problem seems related to your curl command (could not resolve host...), try using double quotes instead ?

The field 'Name' is not using your custom analyzer, is it intended ?
Could you also indent your example correctly.


(Emil) #3

yea, double quotes made difference and i got results like you with curl but it still didnt solve my main problem. So this curl query shows me that in my inverted index and all searches are executed in inverted index right? if I have tn2000 and my multimatch query above should find the tn2000 in the inverted index but why It doesnt work? Do you have any idea?

when I use head plugin and under Actions menu-> Test Analyzer, I type Brother TN-2000 Toner Black and it will return me only brother, tn, 2000, toner and black but not tn2000 as shown in the image. it looks like it uses standart analyzer


(Jimferenczi) #4

Yes it's because you don't map your field with your analyzer. You should define your field like this:

"Name": {
  "store": true,
  "type": "string",
  "index_analyzer": "my_analyzer"
},

You'll need to reindex your data to see the changes...


(Emil) #6

This is how I attempted it

{
"Product": {
"_source": {
"enabled": true
},
"properties": {
"id": {
"store": true,
"index": "no",
"type": "integer"
},
"Name": {
"index_analyzer": "my_analyzer",
"type": "string"
},
"ShortDescription": {
"index_analyzer": "my_analyzer",
"type": "string"
}
}
}
}

but it returns me http 400 with following message

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"Mapping definition for [Name] has unsupported parameters: [index_analyzer : my_analyzer]"}],"type":"mapper_parsing_exception","reason":"Mapping definition for [Name] has unsupported parameters: [index_analyzer : my_analyzer]"},"status":400}


(Emil) #7

analyzer instead of using index_analyzer as below worked fine. what is the difference?

  "Name": {
    "analyzer": "my_analyzer",
    "type": "string"
  },

(Jimferenczi) #8

Yes sorry you need to define a search_analyzer and an analyzer for this.

"Name": {
  "store": true,
  "type": "string",
  "analyzer": "my_analyzer",
  "search_analyzer": "standard"
}
```

(Emil) #9

it worked like a charm. thanks for your help.


(system) #10