Index Size is 4 times than a document size?

vasa47 · April 13, 2016, 3:42pm

With this below setting and mapping I indexed a text file of 1.3MB and now my index size in data/clusters_name/nodes/ size is almost 4 times. What is happening? what settings am I missing?

    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
        },
        {
      "mappings":   {
        "attachment" : {
        "properties" : {
             "file" : {
             "type" : "attachment",
                "fields" : {
                  "title" : { "store" : "yes" },
                   "file" : { "term_vector":"with_positions_offsets", "store":"yes" }
                 }
             }
           }
        }
    }

dadoonet · April 13, 2016, 5:13pm

Please format your code with </> icon.

You have to understand what elasticsearch does in that case:

extracts the text with Tika and adds it to field: file.file
stores this extracted text (you explicitly asked for it)
index this content
extracts every single field content into _all field (I did not check but may be the BASE64 content is also added to this field) and index that
store in _source the JSON doc, including your BASE64 content.

That's why I'm not surprised by your numbers.

I'd at least disable _all field.

Also, note that with 5.0, this plugin moves as a Node Ingest plugin which will give definitely better results!

HTH

vasa47 · April 14, 2016, 10:22am

Okay I am bit confused on how mapper attachments work on indexing a file.

PUT /test
PUT /test/person/_mapping

{
    "person" : {
        "properties" : {
             "my_attachment" : { "type" : "attachment" }
        }
    }
}

PUT /test/person/1

{
    "my_attachment" : "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

This is giving 9.5 kb of index size. In which the base64 content is just 48 bytes. Of course _all is enabled by default or content is stored and even the base64 content is stored or not.

I just want to index a file on the local fileSystem with title and content enabling highlighting. Can you please point out how you do this with the index size factor of 1 or less.

dadoonet · April 14, 2016, 10:43am

Well. You have also some metadata so with a small content like this you can not really measure anything.
Also, you have to know that stored fields are compressed. Here again it depends on the effective compress ratio.

What I recommend is to do extraction of text on your end or use elasticsearch 5.0 and the ingest-attachment plugin (not suitable for production yet).

vasa47 · April 14, 2016, 10:57am

I ll try it out the ingest attachment plugin.

virtualmagister · April 14, 2016, 4:26pm

@dadoonet
So, will Elasticsearch 5 (with Ingestion node / Ingestion-attachment plugin) became a valid and robust choice as primary repository? (blob content)
thanks
Gianni

dadoonet · April 14, 2016, 4:49pm

Not related.

Ingest node does not mean that we will store the "blob" in elasticsearch. That's totally the opposite. We will extract from the binary only the needed text and will index that text only.

virtualmagister · April 19, 2016, 9:42am

Thanks @dadoonet for your answer.
Currently we are using mapper-attachment plugin in a big environment so the binary is in the _source and Elasticsearch is primary and only repository.
In this new scenario, will the blob continue to be stored in the same way?
The text extraction process in the ingestion node is asynchronous to the post?

Thanks,
Gianni

Topic		Replies	Views
Storage use by attachment plugin Elasticsearch	1	374	July 6, 2017
Index size for files much bigger with ES5 compared to ES2 Elasticsearch	6	1107	March 8, 2017
Mapper Plugin Issues Elasticsearch	2	678	July 6, 2017
Indexing large pdf document Elasticsearch	10	5872	July 5, 2017
Re: Indexing binary files - how to store the extracted text only Elasticsearch	9	542	July 6, 2017

Index Size is 4 times than a document size?

Related topics