How does elasticsearch store repeated values across documents?

Say I have a document

{"name":"first","class": "Type-A"}

Then another document

{"name":"second","class": "Type-A"}

Does it store Type-A twice or does it reference repeated values. Lets say I have a million docs like this referring to Type-A? Do ES optimize this somehow?

Thanks,

Jonathan

Two things. Elasticsearch builds an inverted index out of that basically, the inverted index part will contain something like:

Type-A: 1, 2

Where 1 and 2 are the documents id. That's schematic as other things are also added.

But, elasticsearch will also add a stored field named _source which will contain:

{"name":"first","class": "Type-A"}

And

{"name":"second","class": "Type-A"}

As is. But this is compressed by default.

Normally, you don't really have to think about all this.

Do ES optimize this somehow?

As yes, ES optimized as much as possible all that.

Thanks for the response. My team is facing space issues and we want to add another attribute to our documents to query by, but that involves updating billions of documents. So I'm trying to estimate the space cost. Seems the inverted index space would be trivial. Anyway to figure out how elastic compresses things?

My team is facing space issues

Well. Really often the cost of complexity and the tradeoffs are much bigger than buying new hardware (disks). But I can't tell for you.

Anyway to figure out how elastic compresses things?

More about compression here: Index Modules | Elasticsearch Reference [6.2] | Elastic

You can read this section which gives a lot of advices:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.