Whitespace Tokenizer dont works as expected

Marc_Muster · November 21, 2018, 10:13am

Hi there,

I have a problem using the whitespace tokenizer in a custom analyzer.
I give you example documents with the highlights given, what matched:

I have a document 1 lets say:

"_source": {
          "categoryId": 11972638,
          "categoryNames": [
            "DVD-Koffer",
            "CD-Koffer",
            "CD-Aufbewahrung",
            "DVD-Aufbwahrung",
            "DVD-Ordner",
            "EDV-DVD-Aufbewahrung",
            "EDV-CD-Aufbewahrung",
            "CD&DVD Aufbewahrung",
            "Multimediabox"
          ],
          "lvl3Id": 11972638
        },
"highlight": {
  "categoryNames": [
    "<em>DVD</em>-Koffer",
    "<em>DVD</em>-Aufbwahrung",
    "<em>DVD</em>-<em>Ordner</em>",
    "EDV-<em>DVD</em>-Aufbewahrung",
    "CD&<em>DVD</em> Aufbewahrung"
  ]
}

and a document 2 lets say:

"_source": {
          "categoryId": 1170664,
          "categoryNames": [
            "CD-Ordner",
            "CD&DVD Ordner"
          ],
          "lvl3Id": 11972638
        },
"highlight": {
      "categoryNames": [
        "CD-<em>Ordner</em>",
        "CD&<em>DVD</em> <em>Ordner</em>"
      ]
    }

What I search for is: "DVD-Ordner"

"query": {
    "match": {
      "categoryNames": "DVD-Ordner"
    }
  },
  "highlight": {
    "fields": {"categoryNames":{}}
  }

What I want to find is document 1, because it has exactly "DVD-Ordner" in its names. What I've found is document 2. Search is always case-insensitive.

So, the standard analyzer ignores characters like "-" so I used a custom analyzer with the whitespace tokenizer, which (as the documentation said) does exactly the thing I've searched for. Just split the words by whitespace characters into terms and not any other signs.

The analyzer for the index is the following:

"analysis" : {
          "analyzer" : {
              "my_analyzer" : {
                  "type": "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "my_german_stemmer"]
              }
          },
          "filter" : {
              "my_german_stemmer" : {
                  "type" : "stemmer",
                  "name" : "german"
              }
          }
        }

But the tokens generated by the analyzer are not, what I expected.
GET .../_termvectors from document 1:

"terms": {
        "aufbewahrung": {
          "term_freq": 4,
          "tokens": [
            ...
          ]
        },
        "aufbwahrung": {
          "term_freq": 1,
          "tokens": [
            ...
          ]
        },
        "cd": {
          "term_freq": 4,
          "tokens": [
            ...
          ]
        },
        "dvd": {
          "term_freq": 5,
          "tokens": [
            ...
          ]
        },
        "edv": {
          "term_freq": 2,
          "tokens": [
            {
              ...
            },
            {
              ...
            }
          ]
        },
        "koffer": {
          "term_freq": 2,
          "tokens": [
            {
             ...
            },
            {
              ...
            }
          ]
        },
        "multimediabox": {
          "term_freq": 1,
          "tokens": [
            {
              ...
            }
          ]
        },
        "ordner": {
          "term_freq": 1,
          "tokens": [
            {
              ...
            }
          ]
        }
      }

First, you see that "dvd-ordner" is not a single term (which I expected), it is splitted into "dvd" and "ordner". So the - sign is ignored, as the standard analyzer does.

I can't figure out, what I'm doing wrong.

I just want a "simple" search, where "DVD-Ordner" is another search than "DVD Ordner".

Marc_Muster · November 21, 2018, 12:14pm

Solved by using default analyzer and only lowercase.

system · December 19, 2018, 12:14pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Whitespace tokenizer doesn't allow lowercase search? Elasticsearch	2	3006	October 4, 2017
Whitespace analyzer (char-filter And token-filter) Elasticsearch	7	1252	November 27, 2019
Aalyzer issue - terms not getting tokenized on whitespace Elasticsearch	1	303	July 6, 2017
Whitespace tokenizer not working as I'd expect Elasticsearch	3	1098	July 6, 2017
Standard analyzer Elasticsearch	6	327	June 6, 2019

Whitespace Tokenizer dont works as expected

Related topics