Question on _analyze handler defaults


#1

Something that boggles me in regard to how _analyze actually works:

Here's an example of request I'm running

GET /_analyze?char_filter=html_strip
{"I really can swim with my gill at the CafÃ!   "}

What puzzles me in the output is the lack of ! character in the result due to the fact ! not being a html character. What's more, I'm actually seeing tokens there, which suggests there's a tokenizer being used.

My question: what are the default filters and tokenizer for _analyze handler and how can I turn them off?

Full output here:

{
   "tokens": [
      {
         "token": "i",
         "start_offset": 2,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "really",
         "start_offset": 4,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "can",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "swim",
         "start_offset": 15,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "with",
         "start_offset": 20,
         "end_offset": 24,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "my",
         "start_offset": 25,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "gill",
         "start_offset": 28,
         "end_offset": 32,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "at",
         "start_offset": 33,
         "end_offset": 35,
         "type": "<ALPHANUM>",
         "position": 8
      },
      {
         "token": "the",
         "start_offset": 36,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 9
      },
      {
         "token": "cafã",
         "start_offset": 40,
         "end_offset": 44,
         "type": "<ALPHANUM>",
         "position": 10
      }
   ]
}

(Colin Goodheart-Smithe) #2

You cannot use a token filter without a tokenizer. The input to a token filter is a stream of tokens rather than a string so you need the tokenizer to create this stream of tokens to pass to your token filter. The default tokenizer is the standard tokenizer which splits token on whitespace and other punctuation and removes the whitespace and punctuation. This is why you are seeing the ! disappear in your example above.

If you want to test the html_strip filter in isolation I would suggest adding tokenizer=keyword to your URL query parameters. The keyword tokenizer takes the input string and outputs a single token containing the full unmodified text of the input string. For other tokenizers which you might want to use instead see here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Hope that helps


(Magnus Bäck) #3

By default it'll use the standard analyzer but you can override that by explicitly setting another analyzer (perhaps whitespace is what you're looking for?) and/or overriding which tokenizers and filters to use.


#4

@colings86 thank you - that's what I was looking for. And just in case somebody is doing already something similar - I'm writing a simple piece of code to run text against permutations of a set of filters/char_filters/tokenizers to help me with writing custom analyzers.


(system) #5