Tokenizer: whitespace not working with edge_ngram

anoopvalluthadam · February 5, 2018, 7:29am

Trying to include special characters in ngram tokeniser

 DELETE test
    PUT test
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "type": "custom",
              "tokenizer": "whitespace",
              "filter": [
                "lowercase", "ngram", "asciifolding", "stop"
              ]
            }
          },
          "filter": {
            "ngram": {
              "type": "edge_ngram",
              "min_gram": 1,
              "max_gram": 20,
              "token_chars": [
                "letter",
                "digit",
                "punctuation",
                "symbol"
              ]
            }
          }
        }
      },
      "mappings": {
        "doc": {
          "properties": {
            "text": {
              "type": "text",
              "analyzer": "my_analyzer",
              "search_analyzer": "simple"
            }
          }
        }
      }
    }
    PUT test/doc/1
    {
      "text": "2 #Quick Foxes lived and died"
    }
    PUT test/doc/2
    {
      "text": "2 #Quick Foxes lived died"
    }
    PUT test/doc/3
    {
      "text": "2 #Quick Foxes lived died and resurrected their wys "
    }
    
    PUT test/doc/6
    {
      "text": "$100 dollars manga #thenga @trump"
    }

When we try the query

POST test/_refresh
GET test/_search
GET test/doc/_search
{
  "query": {
    "match_phrase": {
      "text": "#Qui"
    }
  }
}

Result is

   {
      "took": 0,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
      }
    }

But, When we try this

GET test/_search
{
  "query": {
    "match_phrase": {
      "text": "fo"
    }
  }
}

Result is

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.0247581,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 1.0247581,
        "_source": {
          "text": "2 #Quick Foxes lived died"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 0.41531453,
        "_source": {
          "text": "2 #Quick Foxes lived and died"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "3",
        "_score": 0.41030136,
        "_source": {
          "text": "2 #Quick Foxes lived died and resurrected their wys "
        }
      }
    ]
  }

verifying the analyzer

GET test/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 #Quick Foxes lived and died"
}

Result

{
  "tokens": [
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "#",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "#q",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "#qu",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "#qui",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "#quic",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "#quick",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    },
    {
      "token": "f",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "fo",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "foxe",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "foxes",
      "start_offset": 9,
      "end_offset": 14,
      "type": "word",
      "position": 2
    },
    {
      "token": "l",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "li",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "liv",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "live",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "lived",
      "start_offset": 15,
      "end_offset": 20,
      "type": "word",
      "position": 3
    },
    {
      "token": "d",
      "start_offset": 25,
      "end_offset": 29,
      "type": "word",
      "position": 5
    },
    {
      "token": "di",
      "start_offset": 25,
      "end_offset": 29,
      "type": "word",
      "position": 5
    },
    {
      "token": "die",
      "start_offset": 25,
      "end_offset": 29,
      "type": "word",
      "position": 5
    },
    {
      "token": "died",
      "start_offset": 25,
      "end_offset": 29,
      "type": "word",
      "position": 5
    }
  ]
}

How do I include special characters in the search?

anoopvalluthadam · February 5, 2018, 7:31am

@dadoonet any thoughts?

johtani · February 5, 2018, 8:23am

You specified "search_analyzer" in your settings.
So, query uses "simple" analyzer for your query.

GET test/_analyze
{
  "analyzer": "simple",
  "text": "#Qui"
}

The index has "#qui", but the query uses "qui".
Then, you cannot get the result you are expected.

anoopvalluthadam · February 5, 2018, 8:35am

I used my_analyzer as well, but extra results are getting. Which analyzer can I use?

anoopvalluthadam · February 5, 2018, 8:36am

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase", "ngram", "asciifolding", "stop"
          ]
        }
      },
      "filter": {
        "ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit",
            "punctuation",
            "symbol"
          ]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer",
          "search_analyzer": "my_analyzer"
        }
      }
    }
  }
}
PUT test/doc/1
{
  "text": "2 #quick Foxes lived and died"
}
PUT test/doc/2
{
  "text": "2 #Quick Foxes lived died"
}
PUT test/doc/3
{
  "text": "2 #Quick Foxes lived died and resurrected their wys "
}
PUT test/doc/6
{
  "text": "$100 dollars manga #thenga @trump"
}
POST test/_refresh
GET test/_search
GET test/doc/_search
{
  "query": {
    "match_phrase": {
      "text": "#qu"
    }
  }
}

Result is

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.9152459,
    "hits": [
      {
        "_index": "test",
        "_type": "doc",
        "_id": "2",
        "_score": 0.9152459,
        "_source": {
          "text": "2 #Quick Foxes lived died"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "6",
        "_score": 0.7442766,
        "_source": {
          "text": "$100 dollars manga #thenga @trump"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "1",
        "_score": 0.5388059,
        "_source": {
          "text": "2 #quick Foxes lived and died"
        }
      },
      {
        "_index": "test",
        "_type": "doc",
        "_id": "3",
        "_score": 0.53597397,
        "_source": {
          "text": "2 #Quick Foxes lived died and resurrected their wys "
        }
      }
    ]
  }
}

In this

{
        "_index": "test",
        "_type": "doc",
        "_id": "6",
        "_score": 0.7442766,
        "_source": {
          "text": "$100 dollars manga #thenga @trump"
        }
      }

is wrong, isn't it?

johtani · February 5, 2018, 8:55am

because, you use ngram from 1 to 20.

you can see what your query is with "explain" param.

GET test/doc/_search?explain=true
{
  "query": {
    "match_phrase": {
      "text": "#qu"
    }
  }
}

Using "my_analyzer" with query, your query is "#" or "#q" or "#qu".
I'm not sure what your requirement in your query...
How about using "whitespace" tokenizer + "lowercase" for search_analyzer?

anoopvalluthadam · February 5, 2018, 9:01am

Requirements is something like this:

text will be

$100 dollars manga #thenga
2 #Quick Foxes lived died

and when I search $10, result should be

$100 dollars manga #thenga

and when I search ied , result should be

2 #Quick Foxes lived died

you can treat it like a replacement of wildcard

johtani · February 5, 2018, 9:20am

you should read https://www.elastic.co/guide/en/elasticsearch/guide/current/full-text-search.html first.

And ied does not work with edge_ngram.
You can not see ied in the following result:

GET test/_analyze
{
  "field": "text",
  "text": "2 #Quick Foxes lived died"
}

And not good for elasticsearch with wildcard especially using a pattern that starts with a wildcard...
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_wildcard_and_regexp_queries.html

anoopvalluthadam · February 5, 2018, 9:20am

Yeah

How about using "whitespace" tokenizer + "lowercase" for search_analyzer?

is the solution.

i am making the * concept using ngram and edge_ngram

system · March 5, 2018, 9:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Edge_Ngram with Whitespace Word Breaker Elasticsearch	4	981	April 28, 2020
Whitespace in search term causes ES to return all entries when ngram analyzer is used Elasticsearch	2	1673	July 5, 2017
Edge_ngram results Elasticsearch	4	342	July 6, 2017
Searching word with special characters Elasticsearch	7	1823	November 4, 2020
Using word_delimiter with edgeNGram ignores Word_Delimiter Token Elasticsearch	3	468	July 5, 2017

Tokenizer: whitespace not working with edge_ngram

Related topics