Elasticsearch custom analyzer not working

jinshui.tang · October 14, 2015, 9:24am

I am using elasticsearch as my search engine, I am now trying to create an custom analyzer to make the field value just lowercase. The following is my code:

Create index and mapping

create index with a custom analyzer named test_lowercase：

curl -XPUT 'localhost:9200/test/' -d '{
  "settings": {
    "analysis": {
      "analyzer": {
        "test_lowercase": {
          "type": "pattern",
          "pattern": "^.*$"
        }
      }
    }
  }
}'

create a mapping using the test_lowercase analyzer for the address field：

curl -XPUT 'localhost:9200/test/_mapping/Users' -d '{
  "Users": {
    "properties": {
      "name": {
        "type": "string"
      },
      "address": {
        "type": "string",
        "analyzer": "test_lowercase"
      }
    }
  }
}'

To verify if the test_lowercase analyzer work:

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}

As we can see, the string 'Beijing China' is indexed as a single lowercase-ed whole term 'beijing china', so the test_lowercase analyzer works fine.

To verify if the field 'address' is using the lowercase analyzer:

curl -XGET 'http://localhost:9200/test/_analyze?field=address&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}
curl -XGET 'http://localhost:9200/test/_analyze?field=name&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "beijing",
    "start_offset" : 1,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 0
  }, {
    "token" : "china",
    "start_offset" : 9,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

As we can see, for the same string 'Beijing China', if we use field=address to analyze, it creates a single item 'beijing china', when using field=name, we got two items 'beijing' and 'china', so it seems the field address is using my custom analyzer 'test_lowercase'.

Insert a document to the test index to see if the analyzer works for documents

curl -XPUT 'localhost:9200/test/Users/12345?pretty' -d '{"name": "Jinshui Tang",  "address": "Beijing China"}'

Unfortunately, the document has been successfully inserted but the address field has not been correctly analyzed. I can't search out it by using the wildcard query as follows:

curl -XGET 'http://localhost:9200/test/Users/_search?pretty' -d '
{
  "query": {
    "wildcard": {
      "address": "*beijing ch*"
    }
  }
}'
{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

List all terms analyzed for the document:

So I run the following commands to see all terms of the document, and I found that the 'Beijing China' is not in the term vector at all.

curl -XGET 'http://localhost:9200/test/Users/12345/_termvector?fields=*&pretty'
{
  "_index" : "test",
  "_type" : "Users",
  "_id" : "12345",
  "_version" : 3,
  "found" : true,
  "took" : 2,
  "term_vectors" : {
    "name" : {
      "field_statistics" : {
        "sum_doc_freq" : 2,
        "doc_count" : 1,
        "sum_ttf" : 2
      },
      "terms" : {
        "jinshui" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 0,
            "start_offset" : 0,
            "end_offset" : 7
          } ]
        },
        "tang" : {
          "term_freq" : 1,
          "tokens" : [ {
            "position" : 1,
            "start_offset" : 8,
            "end_offset" : 12
          } ]
        }
      }
    }
  }
}

We can see that the name is correctly analyzed and it became two terms 'jinshui' and 'tang', but the address is lost.

Can anyone please help? Is there anything I am missing?

Thanks a lot!

cbuescher · October 15, 2015, 10:17am

Hi,

thanks for the question, I think your problem might have to do with a slight misunderstanding regarding the pattern analyzer you use in the example. Note that the pattern parameter specifies the regex for splitting tokens (docs).

In your example your specify a pattern that matches whole lines, but the address strings in the indexed document are not real lines (no line break at the end), so not even one token is produced.

Note that in your analyzer test example you have line breaks:

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d '
Beijing China
'
{
  "tokens" : [ {
    "token" : "\nbeijing china\n",
    "start_offset" : 0,
    "end_offset" : 15,
    "type" : "word",
    "position" : 0
  } ]
}

You can see here that the token contains the line breaks you entered in your request on the command line.

If however you do the request all in one line, you can see there are no tokens produced (there are not line breaks matching your token separation pattern):

curl -XGET 'localhost:9200/test/_analyze?analyzer=test_lowercase&pretty' -d 'Bejing China'
{
  "tokens" : [ ]
}

This however is what the analyzer sees when you index document fields.

If instead of using the pattern analyzer, you use a custom analyzer with a keyword tokenizer followed by a lowercase filter, the example doc you indexed gets returned by your search:

{
    "analysis": {
      "analyzer": {
        "test_lowercase": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": ["lowercase"]
        }
      }
    }
}

jinshui.tang · October 19, 2015, 1:40am

Thanks a lot, you explained why my analyzer is not working. I have changed to use the keyword tokenizer and lowercase filter to solve my problem. Thanks again.

softwaredoug · October 19, 2015, 2:00am

Jinshui, shameless plug: we created a tool elyzer to help debug these sorts of problems with custom analyzers. It might be helpful to you in the future.

Topic		Replies	Views
Shingles in Elasticsearch, why does this example with custom analyzer fail? Elasticsearch	4	346	July 6, 2017
Custom 'pattern' analyzer for the field doesn't work Elasticsearch	4	512	February 6, 2018
Elasticsearch case insensitive - analyzer Elasticsearch	1	821	July 6, 2017
Custom Analyzer doesn't work Elasticsearch	6	2512	July 5, 2017
My custom analyzer is registered but not used during indexing Elasticsearch	1	331	July 6, 2017

Elasticsearch custom analyzer not working

Related topics