Index Size is 4 times than a document size?

With this below setting and mapping I indexed a text file of 1.3MB and now my index size in data/clusters_name/nodes/ size is almost 4 times. What is happening? what settings am I missing?

    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
      "mappings":   {
        "attachment" : {
        "properties" : {
             "file" : {
             "type" : "attachment",
                "fields" : {
                  "title" : { "store" : "yes" },
                   "file" : { "term_vector":"with_positions_offsets", "store":"yes" }

Please format your code with </> icon.

You have to understand what elasticsearch does in that case:

  • extracts the text with Tika and adds it to field: file.file
  • stores this extracted text (you explicitly asked for it)
  • index this content
  • extracts every single field content into _all field (I did not check but may be the BASE64 content is also added to this field) and index that
  • store in _source the JSON doc, including your BASE64 content.

That's why I'm not surprised by your numbers.

I'd at least disable _all field.

Also, note that with 5.0, this plugin moves as a Node Ingest plugin which will give definitely better results!


Okay I am bit confused on how mapper attachments work on indexing a file.

PUT /test
PUT /test/person/_mapping

    "person" : {
        "properties" : {
             "my_attachment" : { "type" : "attachment" }

PUT /test/person/1

    "my_attachment" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

This is giving 9.5 kb of index size. In which the base64 content is just 48 bytes. Of course _all is enabled by default or content is stored and even the base64 content is stored or not.

I just want to index a file on the local fileSystem with title and content enabling highlighting. Can you please point out how you do this with the index size factor of 1 or less.

Well. You have also some metadata so with a small content like this you can not really measure anything.
Also, you have to know that stored fields are compressed. Here again it depends on the effective compress ratio.

What I recommend is to do extraction of text on your end or use elasticsearch 5.0 and the ingest-attachment plugin (not suitable for production yet).

I ll try it out the ingest attachment plugin.

So, will Elasticsearch 5 (with Ingestion node / Ingestion-attachment plugin) became a valid and robust choice as primary repository? (blob content)

Not related.

Ingest node does not mean that we will store the "blob" in elasticsearch. That's totally the opposite. We will extract from the binary only the needed text and will index that text only.

Thanks @dadoonet for your answer.
Currently we are using mapper-attachment plugin in a big environment so the binary is in the _source and Elasticsearch is primary and only repository.
In this new scenario, will the blob continue to be stored in the same way?
The text extraction process in the ingestion node is asynchronous to the post?