What is the difference between using "keyword" tokenizer and "not_analyzed"


What is the difference between using "keyword" tokenizer and "not_analyzed" on the fields?

Does either provider better performance in aggregations compared with the other if the field size is the same (for e.g: if I am aggregating on email addresses, with a company of 1 million employees)


not_analyzed is slightly faster at index time.

The keyword tokenizer allows you to use token filters like lowercase.

It shouldn't matter either way for aggregations once you've paid the (comparatively low) price to build the query for the any filtering you do before the aggregation.

Thanks @nik9000

We are using keyword tokenizer just to use the filter lowercase.

Followup question which I have been wondering about:

Since an aggregation gets all values of a field aggregated on into doc_values or field_data.
1-> Does query filtering improve memory footprint and time taken to fetch aggregations apart from having large number of buckets?
2-> Is there any difference between using Aggregation filter and query filters on aggregation query? Will either change the aggregation performance.

I have instance of two mappings one with fields 'analyzer: keyword' analyzed and other with fields 'not_analyzed'. In some instances why does aggregation on index with fields 'index:not_analyzed' take less time and less field data space.

I am confused what is the real difference between these at search time @nik9000 ?

Using a query to filter is going to be faster because you never have to load anything from doc values.

Ah ha! A thing I forgot! not_analyzed supports doc_values which will use way, way less memory. As soon as you use an analyzer you get field data only. Elasticsearch 5.0 is coming with a thing that lets you use an analyzer, but only one that emits only a single token, and still use doc values.

So my suggestion is not_analyzed all the time.