Keyword analyzer but allow redundant white spaces

Youxu · December 6, 2017, 1:07am

My requirement is, that for example

POST /index/type/1
{
"title": "Hello Elasticsearch"
}

I want to return the doc with exactly matched title but allowing redundant white spaces between words.
For example,
Searching "Hello[sapce]elasticsearch" and "Hello[sapce][sapce][sapce]Elasticsearch" will return the document. But "Hello" or "Elasticsearch" won't return document.

Any suggestion for what analyzer/tokenizer/filter I should use?

cbuescher · December 6, 2017, 10:32am

Hi,

you can use a combination of the "pattern_replace" and "trim" token filters to remove redundant whitespaces in your input fields and also in the query. Unfortunately you cannot use those filters as "normalizers" for "keyword" fields, but you can define the field as a "text" field and use a "keyword_tokenizer" to get almost the same effect. Here is what I mean:

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "char_filter": [],
          "filter": [
            "whitespace_normalization",
            "trim"
          ]
        }
      },
      "filter": {
        "whitespace_normalization": {
          "type": "pattern_replace",
          "pattern": "\\s+",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

The pattern_replace filter should replace multiple whitespace characters by just one. It does this at index and query time for this field. So if you index:

PUT /index/type/1
{
  "title" : "Elasticsearch  In  Action   "
}

You should be able to query it with a different whitespace distribution as well:

POST /index/_search
{
  "query": {
    "match": {
      "title": "Elasticsearch    In    Action  "
    }
  }
}

You can check the token produced by the analysis only contains one whitespace between words:

GET /index/_analyze
{
  "analyzer": "my_analyzer", 
  "text" : "   Elasticsearch    In    Action  "
}

Hope this helps.

Youxu · December 18, 2017, 4:34pm

Thanks! It works as I expected!

system · January 15, 2018, 4:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ask for suggestion on what analyzer to use Elasticsearch	4	447	July 6, 2017
Ask for suggestion on what analyzer to use Elasticsearch	3	390	July 6, 2017
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1205	November 27, 2019
Bug in official document sample Elasticsearch	4	725	July 5, 2017
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	301	July 6, 2017

Keyword analyzer but allow redundant white spaces

Related topics