Question on _analyze handler defaults

mdomans · September 8, 2015, 6:56am

Something that boggles me in regard to how _analyze actually works:

Here's an example of request I'm running

GET /_analyze?char_filter=html_strip
{"I really can swim with my gill at the CafÃ!   "}

What puzzles me in the output is the lack of ! character in the result due to the fact ! not being a html character. What's more, I'm actually seeing tokens there, which suggests there's a tokenizer being used.

My question: what are the default filters and tokenizer for _analyze handler and how can I turn them off?

Full output here:

{
   "tokens": [
      {
         "token": "i",
         "start_offset": 2,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "really",
         "start_offset": 4,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "can",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "swim",
         "start_offset": 15,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "with",
         "start_offset": 20,
         "end_offset": 24,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "my",
         "start_offset": 25,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "gill",
         "start_offset": 28,
         "end_offset": 32,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "at",
         "start_offset": 33,
         "end_offset": 35,
         "type": "<ALPHANUM>",
         "position": 8
      },
      {
         "token": "the",
         "start_offset": 36,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 9
      },
      {
         "token": "cafã",
         "start_offset": 40,
         "end_offset": 44,
         "type": "<ALPHANUM>",
         "position": 10
      }
   ]
}

colings86 · September 8, 2015, 7:08am

You cannot use a token filter without a tokenizer. The input to a token filter is a stream of tokens rather than a string so you need the tokenizer to create this stream of tokens to pass to your token filter. The default tokenizer is the standard tokenizer which splits token on whitespace and other punctuation and removes the whitespace and punctuation. This is why you are seeing the ! disappear in your example above.

If you want to test the html_strip filter in isolation I would suggest adding tokenizer=keyword to your URL query parameters. The keyword tokenizer takes the input string and outputs a single token containing the full unmodified text of the input string. For other tokenizers which you might want to use instead see here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Hope that helps

magnusbaeck · September 8, 2015, 7:09am

By default it'll use the standard analyzer but you can override that by explicitly setting another analyzer (perhaps whitespace is what you're looking for?) and/or overriding which tokenizers and filters to use.

mdomans · September 8, 2015, 7:23am

@colings86 thank you - that's what I was looking for. And just in case somebody is doing already something similar - I'm writing a simple piece of code to run text against permutations of a set of filters/char_filters/tokenizers to help me with writing custom analyzers.

Topic		Replies	Views
Extend built-in analyzers Elasticsearch	8	1332	July 5, 2018
Stopping analyzer to apply on the search part Elasticsearch	1	309	July 6, 2017
Indexing and searching for string '?' Elasticsearch	2	322	July 6, 2017
Analyzer API does not work for Elasticsearch 1.7 Elasticsearch	3	573	May 3, 2017
Default analyzers in elastic search Elasticsearch	2	834	July 5, 2017

Question on _analyze handler defaults

Related topics