What is the difference between using "keyword" tokenizer and "not_analyzed"

photonic_world_2 · May 24, 2016, 5:31pm

Hi,

What is the difference between using "keyword" tokenizer and "not_analyzed" on the fields?

Does either provider better performance in aggregations compared with the other if the field size is the same (for e.g: if I am aggregating on email addresses, with a company of 1 million employees)

Thanks!

nik9000 · May 24, 2016, 6:08pm

not_analyzed is slightly faster at index time.

The keyword tokenizer allows you to use token filters like lowercase.

It shouldn't matter either way for aggregations once you've paid the (comparatively low) price to build the query for the any filtering you do before the aggregation.

photonic_world_2 · May 24, 2016, 6:35pm

Thanks @nik9000

We are using keyword tokenizer just to use the filter lowercase.

Followup question which I have been wondering about:

Since an aggregation gets all values of a field aggregated on into doc_values or field_data.
1-> Does query filtering improve memory footprint and time taken to fetch aggregations apart from having large number of buckets?
2-> Is there any difference between using Aggregation filter and query filters on aggregation query? Will either change the aggregation performance.

photonic_world_2 · May 26, 2016, 4:10pm

I have instance of two mappings one with fields 'analyzer: keyword' analyzed and other with fields 'not_analyzed'. In some instances why does aggregation on index with fields 'index:not_analyzed' take less time and less field data space.

I am confused what is the real difference between these at search time @nik9000 ?

nik9000 · May 26, 2016, 7:24pm

Using a query to filter is going to be faster because you never have to load anything from doc values.

Ah ha! A thing I forgot! not_analyzed supports doc_values which will use way, way less memory. As soon as you use an analyzer you get field data only. Elasticsearch 5.0 is coming with a thing that lets you use an analyzer, but only one that emits only a single token, and still use doc values.

So my suggestion is not_analyzed all the time.

Topic		Replies	Views
Help understanding keyword vs not_analyzed Elasticsearch	4	8763	July 6, 2017
Which Tokenizer to use Elasticsearch	1	453	July 5, 2017
Custom analyzer without a tokenizer Elasticsearch	3	819	July 6, 2017
Index=not_analyzed Vs analyser=keyword Elasticsearch	3	1286	July 5, 2017
Not analyzed vs Analyzed Elasticsearch	2	1182	July 5, 2017

What is the difference between using "keyword" tokenizer and "not_analyzed"

Related topics