Index size for files much bigger with ES5 compared to ES2

I used mapper-attachments in order to index files. created the request to add it to field "file" and it's copied to the other fields. The following mapping definition was used with ES2

{
    "files": {
        "properties": {
            "startDate": {
                "type": "date", "index": "not_analyzed", "store": false, 
                "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"

            },
            "mimetype": {
                "type": "integer", "index": "not_analyzed", "store": false
            },
            "file": { 
                "type": "attachment", 
                "fields": {
                    "title": { "store": false },
                    "content_type": { "store": false, "index": "no" },
                    "content": { "store": false, "term_vector": "with_positions_offsets", "type": "string", "copy_to": ["fileGrams", "fileEn", "fileLang1", "fileLang2"] },
                    "date": { "store": false },
                    "author": { "store": false },
                    "keywords": { "store": false },
                    "content_type" : { "store": false },
                    "language": { "store": false }
                }
            },
            "fileGrams": { 
                "type": "string", "index": "analyzed", "analyzer": "angram"
            },
            "fileEn": { 
                "type": "string", "index": "analyzed", "analyzer": "alangen"
            },
            "fileLang1": { 
                "type": "string", "index": "analyzed", "analyzer": "alang1"
            },
            "fileLang2": { 
                "type": "string", "index": "analyzed", "analyzer": "alang2"
            }
        }
    }
}

with ES5 I switched to ingest-attachment. Due to some changes with ES5 (no string field anymore and multi fields) the updated mapping looks like:

{
    "files": {
        "properties": {
            "startDate": {
                "type": "date", "index": true, "store": false, 
                "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
            },
            "mimetype": {
                "type": "integer", "index": true, "store": false
            },
            "file": { 
                "type": "text", "index": true,
                "fields": {
                    "fileGrams": { 
                        "type": "text", "index": true, "analyzer": "angram"
                    },
                    "fileEn": { 
                        "type": "text", "index": true, "analyzer": "alangen"
                    },
                    "fileLang1": { 
                        "type": "text", "index": true, "analyzer": "alang1"
                    },
                    "fileLang2": { 
                        "type": "text", "index": true, "analyzer": "alang2"
                    }
                }
            }
        }
    }
}

I now face the problem that with a defined set of data (around 120.000 documents, varying from email, pdf, xml etc.) the size of the index is much higher.
ES2: 350MB
ES5: 1800MB

when I remove the fields within "file" field (no multi fields):
ES5: ~600MB

Any ideas/explanations what the reason(s) might be?

Hi David

That's super interesting. I'd not expect such difference.

Can you give the result of the exact mapping you have in 2.x and 5.x by running:

GET index/files/_mapping

Also can you give an example of a JSON document (_source field) in ES 2.x and in ES 5.x (please use the same document so I can really compare)?

And finally, can you run:

GET /yourindexname/_stats?include_segment_file_sizes&human

see below for exact mappings.

providing the source fields takes until Monday (for ES2), I'll reply with both when I got them

the statistics are quite large, any particular part I shall provide?

ES2:

"mappings":{  
    "files":{  
        "properties":{  
            "startDate":{  
                "format":"yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss",
                "type":"date"
            },
            "mimetype":{  
                "type":"integer"
            },
            "fileGrams":{  
                "analyzer":"angram",
                "type":"string"
            },
            "fileEn":{  
                "analyzer":"alangen",
                "type":"string"
            },
            "file":{  
                "type":"attachment",
                "fields":{  
                    "date":{  
                        "type":"string"
                    },
                    "keywords":{  
                        "type":"string"
                    },
                    "content_type":{  
                        "type":"string"
                    },
                    "author":{  
                        "type":"string"
                    },
                    "name":{  
                        "type":"string"
                    },
                    "language":{  
                        "type":"string"
                    },
                    "title":{  
                        "type":"string"
                    },
                    "content":{  
                        "copy_to":[  
                            "fileGrams",
                            "fileEn",
                            "fileLang1",
                            "fileLang2"
                        ],
                        "term_vector":"with_positions_offsets",
                        "type":"string"
                    },
                    "content_length":{  
                        "type":"integer"
                    }
                }
            },
            "fileLang1":{  
                "analyzer":"alang1",
                "type":"string"
            },
            "fileLang2":{  
                "analyzer":"alang2",
                "type":"string"
            }
        }
    }
}

ES5

"mappings": {
  "files": {
    "properties": {
      "attachment": {
        "properties": {
          "content_length": {
            "type": "long"
          },
          "content_type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "language": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },
      "error": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "file": {
        "type": "text"
      },
      "mimetype": {
        "type": "integer"
      },
      "startDate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

Can you change the properties for content_type, language, title fields to be text only and in ingest plugin remove error field?

I wonder what kind of value you can have in error field BTW.

the error field is because of my pipe definition

    cb.startObject()
            .startObject("attachment")
            .field("field", "data")
            .field("target_field", "attachment")
            .field("indexed_chars", "-1")
    //        .startArray("on_failure")
    //        .startObject()
    //        .startObject("set")
    //        .field("field", "error")
    //        .field("value", "{{ _ingest.on_failure_message }} , {{ _ingest.on_failure_processor_type }}")
    //        .endObject()
    //        .endObject()
    //        .endArray()
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("set")
            .field("field", "file")
            .field("value", "{{ attachment.content }}")
            .field("ignore_failure", true)
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("remove")
            .field("field", "attachment")
            .field("ignore_failure", true)
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("remove")
            .field("field", "data")
            .field("ignore_failure", true)
            .endObject()
            .endObject();

as you can see I removed the error field and also everything within attachment where the ingest plug-in writes the extracted data, the data field gets removed as well (containing the base64 encoded data).

The size seems almost the same now with my data.

But I do not have any multi-fields for the content defining different analyzers at the moment.

So I'm wondering if I did it right with ES2 and the copy_to instruction, which should have copied the extracted content to the other fields I defined. For me it seems that my last config (with ES2) was wrong or did not work. I could understand that the index would need x times more space when having x times more fields with the same content but different analyzers.

Any ideas on that?

I just verified that the copy-to did not work as I expected. Could you tell me what I was doing wrong?
Here's the relevant mapping definition for ES 2.3.1

        "file": { 
            "type": "attachment", 
            "fields": {
                "title": { "store": false },
                "content_type": { "store": false, "index": "no" },
                "content": { "store": false, "term_vector": "with_positions_offsets", "type": "string", "copy_to": ["fileGrams", "fileEn", "fileLang1", "fileLang2"] },
                "date": { "store": false },
                "author": { "store": false },
                "keywords": { "store": false },
                "content_type" : { "store": false },
                "language": { "store": false }
            }
        },
        "fileGrams": { 
            "type": "string", "index": "analyzed", "analyzer": "angram"
        },
        "fileEn": { 
            "type": "string", "index": "analyzed", "analyzer": "alangen"
        },
        "fileLang1": { 
            "type": "string", "index": "analyzed", "analyzer": "alang1"
        },
        "fileLang2": { 
            "type": "string", "index": "analyzed", "analyzer": "alang2"
        }

I'm just curious why it didn't work and would love to know why.

Anyhow, my original question is obsolete, sorry for any misleading!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.