Elasticsearch custom analyzer not working


#1

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:

Create index and mapping

create index with a custom analyzer named test_lowercase:

curl -XPUT 'localhost:9200/test/' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_lowercase": {
          "type": "pattern",
          "pattern": "^.*$"
        }
      }
    }
  }
}'

create a mapping using the test_lowercase analyzer for the address field:

curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
  "Users": {
    "properties": {
      "name": {
        "type": "string"
      },
      "address": {
        "type": "string",
        "analyzer": "test_lowercase"
      }
    }
  }
}'

To verify if the test_lowercase analyzer work:

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}

As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.

To verify if the field 'address' is using the lowercase analyzer:

curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "beijing",
    "start_offset" : 1,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "china",
    "start_offset" : 9,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.

Insert a document to the test index to see if the analyzer works for documents

curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang",  "address": "Beijing China"}'

Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:

curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
  "query": {
    "wildcard": {
      "address": "*beijing ch*"
    }
  }
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

List all terms analyzed for the document:

So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.

curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
  "_index" : "test",
  "_type" : "Users",
  "_id" : "12345",
  "_version" : 3,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "name" : {
      "field_statistics" : {
        "sum_doc_freq" : 2,
        "doc_count" : 1,
        "sum_ttf" : 2
      },
      "terms" : {
        "jinshui" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7
          } ]
        },
        "tang" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12
          } ]
        }
      }
    }
  }
}

We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.

Can anyone please help? Is there anything I am missing?

Thanks a lot!


(Christoph) #2

Hi,

thanks for the question, I think your problem might have to do with a slight misunderstanding regarding the pattern analyzer you use in the example. Note that the pattern parameter specifies the regex for splitting tokens (docs).

In your example your specify a pattern that matches whole lines, but the address strings in the indexed document are not real lines (no line break at the end), so not even one token is produced.

Note that in your analyzer test example you have line breaks:

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}

You can see here that the token contains the line breaks you entered in your request on the command line.

If however you do the request all in one line, you can see there are no tokens produced (there are not line breaks matching your token separation pattern):

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d 'Bejing China'
{
  "tokens" : [ ]
}

This however is what the analyzer sees when you index document fields.

If instead of using the pattern analyzer, you use a custom analyzer with a keyword tokenizer followed by a lowercase filter, the example doc you indexed gets returned by your search:

{
    "analysis": {
      "analyzer": {
        "test_lowercase": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
}

#3

Thanks a lot, you explained why my analyzer is not working. I have changed to use the keyword tokenizer and lowercase filter to solve my problem. Thanks again.


(Doug Turnbull) #4

Jinshui, shameless plug: we created a tool elyzer to help debug these sorts of problems with custom analyzers. It might be helpful to you in the future.


(system) #5