Search analyzer not work

Hi, I've set a custom analyzer, but it seems not work when doing search.

ElasticSearch version: 7.0.1

Create index:

PUT /analyze_email
{
    "settings": {
        "number_of_shards": "1",
        "number_of_replicas": "0",
        "analysis": {
            "filter": {
                "email": {
                    "type": "pattern_capture",
                    "preserve_original": "true",
                    "patterns": [
                        "([^@]+)",
                        "(\\p{L}+)",
                        "(\\d+)",
                        "@(.+)",
                        "(@)"
                    ]
                }
            },
            "analyzer": {
                "email_analyzer": {
                    "tokenizer": "uax_url_email",
                    "filter": [
                        "lowercase",
                        "email",
                        "unique"
                    ]
                }
            },
            "normalizer": {
                "lowercase_normalizer": {
                    "type": "custom",
                    "filter": [
                        "lowercase"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "email_address": {
                "type": "text",
                "analyzer": "email_analyzer",
                "norms": "false",
                "search_analyzer": "email_analyzer", 
                "fields": {
                    "raw": {
                        "type": "keyword",
                        "normalizer": "lowercase_normalizer"
                    }
                }
            }
        }
    }
}

Test email_analyzer, it works.

GET analyze_email/_analyze?pretty
{
  "field": "email_address",
  "text": "jinliantest@gmail.com"
}

Result:

{
  "tokens" : [
    {
      "token" : "jinliantest@gmail.com",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    },
    {
      "token" : "jinliantest",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    },
    {
      "token" : "@",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    },
    {
      "token" : "gmail.com",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    },
    {
      "token" : "gmail",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    },
    {
      "token" : "com",
      "start_offset" : 0,
      "end_offset" : 21,
      "type" : "<EMAIL>",
      "position" : 0
    }
  ]
}

I put a doc to the index.

PUT analyze_email/_doc/1
{
  "email_address": "jinliantest@gmail.com"
}

While I did the search by "@", it didn't work, it should return doc 1?

GET analyze_email/_search
{
  "query": {
    "match": {"email_address": "@"}
  }
}

Result:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Thanks for trying to format your post. much appreciated.
Please just don't use the citation icon but the code icon </>. I updated your post.

Just try:

GET analyze_email/_analyze
{
  "analyzer": "email_analyzer",
  "text": ["@"]
}

You will get:

{
  "tokens" : [ ]
}

Which explains why you can't find anything then.

Note that this works though:

GET analyze_email/_search
{
  "query": {
    "term": {
      "email_address": "@"
    }
  }
}

That's because the match query will analyze the search value to a list of tokens?
Will the analyzer work if email_address is a list?

Yes.

It does not change anything in the behavior.

I tried this request, it didn't have token "@"

GET analyze_email/_analyze?pretty
{
  "field": "email_address",
  "text": "com@s.s"
}

Result:

{
  "tokens" : [
    {
      "token" : "com",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "s.s",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "s",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Is this because I didn't define the patterns correctly? or caused by 'uax_url_email'?
Expected tokens: com, s.s, s, @, com@s.s

I don't know. I'm not good at regex. May be it's the pattern capture. Did you try to debug how the analyzer transforms your text?

See https://www.elastic.co/guide/en/elasticsearch/reference/current/_explain_analyze.html

Thanks.
I debug the analyzer.
It's because the tokenizer uax_url_email drop some terms. I should try other way.

GET _analyze
{
  "tokenizer" : "uax_url_email",
  "text" : "com@s.s",
  "explain" : true
}

Response:

{
  "detail" : {
    "custom_analyzer" : true,
    "charfilters" : [ ],
    "tokenizer" : {
      "name" : "uax_url_email",
      "tokens" : [
        {
          "token" : "com",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "<ALPHANUM>",
          "position" : 0,
          "bytes" : "[63 6f 6d]",
          "positionLength" : 1,
          "termFrequency" : 1
        },
        {
          "token" : "s.s",
          "start_offset" : 4,
          "end_offset" : 7,
          "type" : "<ALPHANUM>",
          "position" : 1,
          "bytes" : "[73 2e 73]",
          "positionLength" : 1,
          "termFrequency" : 1
        }
      ]
    },
    "tokenfilters" : [ ]
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.