Searching for "foo" should also find occurrence of "foo.bar"


(Marian Steinbach) #1

We have ElasticSearch 1.5 set up with a very simple mapping to perform full text search in our docs. When searching for "swarmvars" we get no hits, although "swarmvars.json" appears in documents.

The field "text" is used as a catch-all field for all searchable content (title, document body, keywords). Here is the mapping:

"properties": {
  ...,
  "text": {
    "type": "string",
    "store": true,
    "index": "analyzed",
    "term_vector": "with_positions_offsets",
    "analyzer": "english",
  }
}

When using the "english" analyzer on the text "Text containing swarmvars.json and more", the result are these tokens:

text
contain
swarmvars.json
more

Having the token "swarmvars.json" is fine. What I need are two additional tokens "swarmvars" and "json". How can I achieve that?

I was looking into creating a custom tokenizer, but I was unable to get it to work (errors when applying the settings) and also I was unable to find an example, no matter how I searched.

Thanks!


(Jason Wee) #2

simple analyzer shoudl be able to split that into token? underneath is use lowercase tokenizer https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lowercase-tokenizer.html

It divides text at non-letters and converts them to lower case.

$ curl -XGET 'localhost:9200/my_index/_analyze?analyzer=simple&pretty&explain' -d 'Text containing swarmvars.json and more'
{
  "tokens" : [ {
    "token" : "text",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "containing",
    "start_offset" : 5,
    "end_offset" : 15,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "swarmvars",
    "start_offset" : 16,
    "end_offset" : 25,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "json",
    "start_offset" : 26,
    "end_offset" : 30,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "and",
    "start_offset" : 31,
    "end_offset" : 34,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "more",
    "start_offset" : 35,
    "end_offset" : 39,
    "type" : "word",
    "position" : 6
  } ]
}

hth

jason


(system) #3