Index size of string fields, analyzed vs. not_analyzed


(calc) #1

Hi there, I originally indexed documents of about 50 fields (where 20 of
them are string types) with default mapping settings, so those string
fields are all analyzed. Recently I realized that I actually don't need to
do free-text search on 17 of those string fields, instead I'd most likely
do term filters on them, so it won't do any good to analyze those 17
fields. So I set "not_analyzed" on those 17 fields to created a new index.
But now I observe that, after indexing a few million documents, the new
index size is about 2.4KB per document on average, a little more than the
previous index (2.26KB per document).

I thought setting "not_analyzed" on that many string fields would save some
index size. Was that a wrong assumption?Thanks for any enlightment.


(Daniel Schnell) #2

The standard analyzer removes e.g. stop words. Depending on the data you are using this can be a considerable amount.
'Not analyzed' does not touch the content at all so you'll end up having all the data of your so specified fields in your index.

Am 16.06.2012 um 18:55 schrieb calc:

Hi there, I originally indexed documents of about 50 fields (where 20 of them are string types) with default mapping settings, so those string fields are all analyzed. Recently I realized that I actually don't need to do free-text search on 17 of those string fields, instead I'd most likely do term filters on them, so it won't do any good to analyze those 17 fields. So I set "not_analyzed" on those 17 fields to created a new index. But now I observe that, after indexing a few million documents, the new index size is about 2.4KB per document on average, a little more than the previous index (2.26KB per document).

I thought setting "not_analyzed" on that many string fields would save some index size. Was that a wrong assumption?Thanks for any enlightment.


(calc) #3

Hmm. Thanks for the reply. Most of the string fields I changed to
"not_analyzed" are actually 8-byte hash-ids. I stored them as strings
(printed as decimal numbers) only because ES (written in java) can't handle
8-byte unsigned intergers. So there's no stopwords to remove in those
fields if they are analyzed. A couple of other string fields may contain
valid English words and occasionally contain stopwords, but it is likely a
very minimal factor.

Now that I think about it again, I guess in this case the index size is
affected by a tradeoff between number of entries (i.e. number of different
tokens) in the index, and the number of documents each token in the index
has to point to. E.g., assuming we have the following 3 documents with
document id and string values:

101, "iron"
102, "lady"
103, "iron lady"

If the string field is analyzed, we'd have 2 entries in the index, each
entry pointing to 2 documents:

"iron" => 101, 103
"lady" => 102, 103

But if the string field is not_analyzed, we'd have 3 entries in the index,
each entry pointing to 1 document:

"iron" => 101
"lady" => 102
"iron lady" => 103

It is indeed not obvious which way the index size would be larger. It all
depends on the data.

The main reason I changed those string fields to not_analyzed is so that I
can use exact-match term filters on them. I was hoping as a side-effect
it'd reduce the index size. Turns out I hoped wrong. :slight_smile:

On Sunday, June 17, 2012 8:05:58 AM UTC-4, Daniel Schnell wrote:

The standard analyzer removes e.g. stop words. Depending on the data you
are using this can be a considerable amount.
'Not analyzed' does not touch the content at all so you'll end up having
all the data of your so specified fields in your index.

Am 16.06.2012 um 18:55 schrieb calc:

Hi there, I originally indexed documents of about 50 fields (where 20 of
them are string types) with default mapping settings, so those string
fields are all analyzed. Recently I realized that I actually don't need to
do free-text search on 17 of those string fields, instead I'd most likely
do term filters on them, so it won't do any good to analyze those 17
fields. So I set "not_analyzed" on those 17 fields to created a new index.
But now I observe that, after indexing a few million documents, the new
index size is about 2.4KB per document on average, a little more than the
previous index (2.26KB per document).

I thought setting "not_analyzed" on that many string fields would save
some index size. Was that a wrong assumption?Thanks for any enlightment.


(system) #4