Index size for files much bigger with ES5 compared to ES2


(David Pocivalnik) #1

I used mapper-attachments in order to index files. created the request to add it to field "file" and it's copied to the other fields. The following mapping definition was used with ES2

{
    "files": {
        "properties": {
            "startDate": {
                "type": "date", "index": "not_analyzed", "store": false, 
                "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"

            },
            "mimetype": {
                "type": "integer", "index": "not_analyzed", "store": false
            },
            "file": { 
                "type": "attachment", 
                "fields": {
                    "title": { "store": false },
                    "content_type": { "store": false, "index": "no" },
                    "content": { "store": false, "term_vector": "with_positions_offsets", "type": "string", "copy_to": ["fileGrams", "fileEn", "fileLang1", "fileLang2"] },
                    "date": { "store": false },
                    "author": { "store": false },
                    "keywords": { "store": false },
                    "content_type" : { "store": false },
                    "language": { "store": false }
                }
            },
            "fileGrams": { 
                "type": "string", "index": "analyzed", "analyzer": "angram"
            },
            "fileEn": { 
                "type": "string", "index": "analyzed", "analyzer": "alangen"
            },
            "fileLang1": { 
                "type": "string", "index": "analyzed", "analyzer": "alang1"
            },
            "fileLang2": { 
                "type": "string", "index": "analyzed", "analyzer": "alang2"
            }
        }
    }
}

with ES5 I switched to ingest-attachment. Due to some changes with ES5 (no string field anymore and multi fields) the updated mapping looks like:

{
    "files": {
        "properties": {
            "startDate": {
                "type": "date", "index": true, "store": false, 
                "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
            },
            "mimetype": {
                "type": "integer", "index": true, "store": false
            },
            "file": { 
                "type": "text", "index": true,
                "fields": {
                    "fileGrams": { 
                        "type": "text", "index": true, "analyzer": "angram"
                    },
                    "fileEn": { 
                        "type": "text", "index": true, "analyzer": "alangen"
                    },
                    "fileLang1": { 
                        "type": "text", "index": true, "analyzer": "alang1"
                    },
                    "fileLang2": { 
                        "type": "text", "index": true, "analyzer": "alang2"
                    }
                }
            }
        }
    }
}

I now face the problem that with a defined set of data (around 120.000 documents, varying from email, pdf, xml etc.) the size of the index is much higher.
ES2: 350MB
ES5: 1800MB

when I remove the fields within "file" field (no multi fields):
ES5: ~600MB

Any ideas/explanations what the reason(s) might be?


(David Pilato) #2

Hi David

That's super interesting. I'd not expect such difference.

Can you give the result of the exact mapping you have in 2.x and 5.x by running:

GET index/files/_mapping

Also can you give an example of a JSON document (_source field) in ES 2.x and in ES 5.x (please use the same document so I can really compare)?

And finally, can you run:

GET /yourindexname/_stats?include_segment_file_sizes&human

(David Pocivalnik) #3

see below for exact mappings.

providing the source fields takes until Monday (for ES2), I'll reply with both when I got them

the statistics are quite large, any particular part I shall provide?

ES2:

"mappings":{  
    "files":{  
        "properties":{  
            "startDate":{  
                "format":"yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss",
                "type":"date"
            },
            "mimetype":{  
                "type":"integer"
            },
            "fileGrams":{  
                "analyzer":"angram",
                "type":"string"
            },
            "fileEn":{  
                "analyzer":"alangen",
                "type":"string"
            },
            "file":{  
                "type":"attachment",
                "fields":{  
                    "date":{  
                        "type":"string"
                    },
                    "keywords":{  
                        "type":"string"
                    },
                    "content_type":{  
                        "type":"string"
                    },
                    "author":{  
                        "type":"string"
                    },
                    "name":{  
                        "type":"string"
                    },
                    "language":{  
                        "type":"string"
                    },
                    "title":{  
                        "type":"string"
                    },
                    "content":{  
                        "copy_to":[  
                            "fileGrams",
                            "fileEn",
                            "fileLang1",
                            "fileLang2"
                        ],
                        "term_vector":"with_positions_offsets",
                        "type":"string"
                    },
                    "content_length":{  
                        "type":"integer"
                    }
                }
            },
            "fileLang1":{  
                "analyzer":"alang1",
                "type":"string"
            },
            "fileLang2":{  
                "analyzer":"alang2",
                "type":"string"
            }
        }
    }
}

ES5

"mappings": {
  "files": {
    "properties": {
      "attachment": {
        "properties": {
          "content_length": {
            "type": "long"
          },
          "content_type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "language": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "title": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },
      "error": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "file": {
        "type": "text"
      },
      "mimetype": {
        "type": "integer"
      },
      "startDate": {
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss:SSS||yyyy-MM-dd HH:mm:ss"
      }
    }
  }
}

(David Pilato) #4

Can you change the properties for content_type, language, title fields to be text only and in ingest plugin remove error field?

I wonder what kind of value you can have in error field BTW.


(David Pocivalnik) #5

the error field is because of my pipe definition

    cb.startObject()
            .startObject("attachment")
            .field("field", "data")
            .field("target_field", "attachment")
            .field("indexed_chars", "-1")
    //        .startArray("on_failure")
    //        .startObject()
    //        .startObject("set")
    //        .field("field", "error")
    //        .field("value", "{{ _ingest.on_failure_message }} , {{ _ingest.on_failure_processor_type }}")
    //        .endObject()
    //        .endObject()
    //        .endArray()
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("set")
            .field("field", "file")
            .field("value", "{{ attachment.content }}")
            .field("ignore_failure", true)
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("remove")
            .field("field", "attachment")
            .field("ignore_failure", true)
            .endObject()
            .endObject();
    cb.startObject()
            .startObject("remove")
            .field("field", "data")
            .field("ignore_failure", true)
            .endObject()
            .endObject();

as you can see I removed the error field and also everything within attachment where the ingest plug-in writes the extracted data, the data field gets removed as well (containing the base64 encoded data).

The size seems almost the same now with my data.

But I do not have any multi-fields for the content defining different analyzers at the moment.

So I'm wondering if I did it right with ES2 and the copy_to instruction, which should have copied the extracted content to the other fields I defined. For me it seems that my last config (with ES2) was wrong or did not work. I could understand that the index would need x times more space when having x times more fields with the same content but different analyzers.

Any ideas on that?


(David Pocivalnik) #6

I just verified that the copy-to did not work as I expected. Could you tell me what I was doing wrong?
Here's the relevant mapping definition for ES 2.3.1

        "file": { 
            "type": "attachment", 
            "fields": {
                "title": { "store": false },
                "content_type": { "store": false, "index": "no" },
                "content": { "store": false, "term_vector": "with_positions_offsets", "type": "string", "copy_to": ["fileGrams", "fileEn", "fileLang1", "fileLang2"] },
                "date": { "store": false },
                "author": { "store": false },
                "keywords": { "store": false },
                "content_type" : { "store": false },
                "language": { "store": false }
            }
        },
        "fileGrams": { 
            "type": "string", "index": "analyzed", "analyzer": "angram"
        },
        "fileEn": { 
            "type": "string", "index": "analyzed", "analyzer": "alangen"
        },
        "fileLang1": { 
            "type": "string", "index": "analyzed", "analyzer": "alang1"
        },
        "fileLang2": { 
            "type": "string", "index": "analyzed", "analyzer": "alang2"
        }

I'm just curious why it didn't work and would love to know why.

Anyhow, my original question is obsolete, sorry for any misleading!


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.