Keyword analyzer but allow redundant white spaces


(Xudong You) #1

My requirement is, that for example

POST /index/type/1
{
"title": "Hello Elasticsearch"
}

I want to return the doc with exactly matched title but allowing redundant white spaces between words.
For example,
Searching "Hello[sapce]elasticsearch" and "Hello[sapce][sapce][sapce]Elasticsearch" will return the document. But "Hello" or "Elasticsearch" won't return document.

Any suggestion for what analyzer/tokenizer/filter I should use?


(Christoph) #2

Hi,

you can use a combination of the "pattern_replace" and "trim" token filters to remove redundant whitespaces in your input fields and also in the query. Unfortunately you cannot use those filters as "normalizers" for "keyword" fields, but you can define the field as a "text" field and use a "keyword_tokenizer" to get almost the same effect. Here is what I mean:

PUT index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "char_filter": [],
          "filter": [
            "whitespace_normalization",
            "trim"
          ]
        }
      },
      "filter": {
        "whitespace_normalization": {
          "type": "pattern_replace",
          "pattern": "\\s+",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

The pattern_replace filter should replace multiple whitespace characters by just one. It does this at index and query time for this field. So if you index:

PUT /index/type/1
{
  "title" : "Elasticsearch  In  Action   "
}

You should be able to query it with a different whitespace distribution as well:

POST /index/_search
{
  "query": {
    "match": {
      "title": "Elasticsearch    In    Action  "
    }
  }
}

You can check the token produced by the analysis only contains one whitespace between words:

GET /index/_analyze
{
  "analyzer": "my_analyzer", 
  "text" : "   Elasticsearch    In    Action  "
}

Hope this helps.


(Xudong You) #3

Thanks! It works as I expected!


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.