Whitespace Tokenizer dont works as expected


(Marc Muster) #1

Hi there,

I have a problem using the whitespace tokenizer in a custom analyzer.
I give you example documents with the highlights given, what matched:

I have a document 1 lets say:

"_source": {
          "categoryId": 11972638,
          "categoryNames": [
            "DVD-Koffer",
            "CD-Koffer",
            "CD-Aufbewahrung",
            "DVD-Aufbwahrung",
            "DVD-Ordner",
            "EDV-DVD-Aufbewahrung",
            "EDV-CD-Aufbewahrung",
            "CD&DVD Aufbewahrung",
            "Multimediabox"
          ],
          "lvl3Id": 11972638
        },
"highlight": {
  "categoryNames": [
    "<em>DVD</em>-Koffer",
    "<em>DVD</em>-Aufbwahrung",
    "<em>DVD</em>-<em>Ordner</em>",
    "EDV-<em>DVD</em>-Aufbewahrung",
    "CD&<em>DVD</em> Aufbewahrung"
  ]
}

and a document 2 lets say:

"_source": {
          "categoryId": 1170664,
          "categoryNames": [
            "CD-Ordner",
            "CD&DVD Ordner"
          ],
          "lvl3Id": 11972638
        },
"highlight": {
      "categoryNames": [
        "CD-<em>Ordner</em>",
        "CD&<em>DVD</em> <em>Ordner</em>"
      ]
    }

What I search for is: "DVD-Ordner"

"query": {
    "match": {
      "categoryNames": "DVD-Ordner"
    }
  },
  "highlight": {
    "fields": {"categoryNames":{}}
  }

What I want to find is document 1, because it has exactly "DVD-Ordner" in its names. What I've found is document 2. Search is always case-insensitive.

So, the standard analyzer ignores characters like "-" so I used a custom analyzer with the whitespace tokenizer, which (as the documentation said) does exactly the thing I've searched for. Just split the words by whitespace characters into terms and not any other signs.

The analyzer for the index is the following:

"analysis" : {
          "analyzer" : {
              "my_analyzer" : {
                  "type": "custom",
                  "tokenizer" : "whitespace",
                  "filter" : ["lowercase", "my_german_stemmer"]
              }
          },
          "filter" : {
              "my_german_stemmer" : {
                  "type" : "stemmer",
                  "name" : "german"
              }
          }
        }

But the tokens generated by the analyzer are not, what I expected.
GET .../_termvectors from document 1:

"terms": {
        "aufbewahrung": {
          "term_freq": 4,
          "tokens": [
            ...
          ]
        },
        "aufbwahrung": {
          "term_freq": 1,
          "tokens": [
            ...
          ]
        },
        "cd": {
          "term_freq": 4,
          "tokens": [
            ...
          ]
        },
        "dvd": {
          "term_freq": 5,
          "tokens": [
            ...
          ]
        },
        "edv": {
          "term_freq": 2,
          "tokens": [
            {
              ...
            },
            {
              ...
            }
          ]
        },
        "koffer": {
          "term_freq": 2,
          "tokens": [
            {
             ...
            },
            {
              ...
            }
          ]
        },
        "multimediabox": {
          "term_freq": 1,
          "tokens": [
            {
              ...
            }
          ]
        },
        "ordner": {
          "term_freq": 1,
          "tokens": [
            {
              ...
            }
          ]
        }
      }

First, you see that "dvd-ordner" is not a single term (which I expected), it is splitted into "dvd" and "ordner". So the - sign is ignored, as the standard analyzer does.

I can't figure out, what I'm doing wrong.

I just want a "simple" search, where "DVD-Ordner" is another search than "DVD Ordner".


(Marc Muster) #2

Solved by using default analyzer and only lowercase.