SystemMemoryOutofException thrown while indexing files as an attachment

ASN · May 27, 2016, 7:15am

okok. I tried my best to keep it simple and understandable.
And one more doubt David. Why is the size of files increased when indexed ?

The total size of the folder from which documents are indexed is 30mb. But the head plugin is showing 127mb for the same number of files(which are indexed from the same folder)

dadoonet · May 27, 2016, 7:52am

It depends.

But basically here are my thoughts.

When you index a doc (using defaults), elasticsearch is:

indexing all the fields one by one
storing in _source the JSON
creating doc values data structure (column oriented data structure for aggregations)
creating on the fly a flat version of your JSON doc in _all field and index it

Then add to this analyzers. If you are using multi fields, it can become worse.

On advice: control your mapping and adjust it wisely.

You may be don't need to index everything.
You may be don't want to store the binary BASE64 content in your _source field: that's why we created ingest-attachment plugin in 5.0 basically.
You may want to disable _all field...

HTH

ASN · May 27, 2016, 8:00am

But I read that if we dont store the binary BASE64 content in _source field we cannot use the highlight option.

This is my mapping style

{
  "mappings": {
    "document": {
      "properties": {
        "title": {
          "type": "string"
        },
        "file": {
          "type": "attachment",
          "fields": {
            "content": {
              "type": "string",
              "store": true,
              "term_vector": "with_positions_offsets_payloads",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

I havent used any of the _all field or _source field

dadoonet · May 27, 2016, 8:50am

The only way ATM to remove the BASE64 from the document is to source exclude.

Note that you are highlighting on the extracted content file.content which is generated at index time but you are not highlighting on file which has the BASE64 text.

So you are also storing file.content which requires more space obviously even if it's compressed on disk.

_all is enabled by default. https://www.elastic.co/guide/en/elasticsearch/reference/2.3/mapping-all-field.html

ASN · May 27, 2016, 8:55am

oh ok. Will make the changed in my mapping.
David, did you look at github for fscrawler? I'm still facing that issue.

dadoonet · May 27, 2016, 9:10am

Nope. Lot of stuff on my plates

ASN · May 27, 2016, 9:11am

Ohh okok.. No problem.. Thanks for the help all this week.

ASN · May 30, 2016, 3:33am

hi David,
I observed this while indexing the documents.

(Dont know if I'm correct with this) When I indexed the documents using manual id, the size is around 36mb but when I remove the Id field and index(auto generate the id), then it is taking so much time to index, the size is more and also the search function is not working properly. Does it depend on how the file is indexed?)

dadoonet · May 30, 2016, 5:10am

Do you have an example?

ASN · May 30, 2016, 5:39am

Hi David,
Here is the code how I'm giving manual id's for the documents indexed. But when I comment out Id field in all the occurrences in the below snaps, It is supposed to be autogenerated and it is autogenerating..

class Document
    {
        public int Id { get; set; } 

        public string Title { get; set; } 

        public string FilePath { get; set; }

        public Attachment File { get; set; }

    }

Mappings:

 var createIndexResponse =
                    client.CreateIndex(defaultIndex, c => c
                    .Mappings(m => m
                    .Map<Document>(mp => mp
                    .Properties(ps => ps
                        .Number(n => n.Name(e => e.Id))
                        .String(s => s.Name(e => e.Title))

While Indexing: (where counter =1 on start)

 foreach (string file in filesList)
                {
                    Attachment attach = new Attachment
                    {
                        Name = Path.GetFileNameWithoutExtension(file),
                        Content = Convert.ToBase64String(File.ReadAllBytes(file)),
                        ContentType = Path.GetExtension(file)
                    };

                    var doc = new Document()
                    {
                        Id = counter,
                        Title = Path.GetFileNameWithoutExtension(file),
                        FilePath = Path.GetFullPath(file), //added to get the path of the file
                        File = attach
                    };

                    list.Add(doc);

                    counter++;
                }
 var response = client.IndexMany(list, "indexname");

dadoonet · May 30, 2016, 5:52am

I don't know what all that code is doing when calling Elasticsearch.

Is this id the same as _id we have in Elasticsearch as a metadata field?

ASN · May 30, 2016, 5:56am

Once i made a search , this is what I see as a result in sense plugin. I made _id same as the document id I'm entering in the above code snippet.

dadoonet · May 30, 2016, 6:14am

Could you reproduce this?

Create a doc with id and the other one without id
Without any attachment just one field.

And report here?

I'm surprised that automatic id are taking so much space/time.

ASN · May 30, 2016, 7:12am

I tried it. As there are no attachments the docs got indexed in the same time.
And the size differs only by 6kb (custom id-36kb where as autogen id 30kb for a total of 50docs with fields like title and filePath)

dadoonet · May 30, 2016, 8:33am

Can you reproduce it with a shell script?

ASN · May 30, 2016, 8:38am

I dont know how to do it using shell script. In the above code snippets i just commented unnecessary fields and then indexed the documents and this is what I observed.

I tried to index docs using sense plugin.

with user defined id of 1

PUT abc/docs/1
{
   "title":"trial1"
}

Without user defined id

PUT abc/docs/
{
   "title":"trial1"
}

Then it threw this error:
No handler found for uri [/abc/docs/] and method [PUT]

dadoonet · May 30, 2016, 8:51am

For the later you have to use POST verb.

ASN · May 30, 2016, 8:57am

ok David. Ran it with POST.

dadoonet · May 30, 2016, 9:25am

Sorry if I was unclear but this is not what I was expecting from you.

You told that you have a problem when you index data containing attachments.
When you define the _id it takes a way lesser space on disk than when you don't define the _id.

I asked to reproduce this with a full script.

Which basically means create a script like:

DELETE index
PUT index
{
  "mappings": ... // Your mapping here
}
PUT index/doc/1
{
  "attachment": "BASE64 content"
}
GET _cat/indices/index?v

Attach the output of the later command.

Then run:

DELETE index
PUT index
{
  "mappings": ... // Your mapping here
}
POST index/doc
{
  "attachment": "BASE64 content"
}
GET _cat/indices/index?v

Attach the output of the later command.

dadoonet · May 30, 2016, 10:14am

This is why I asked for a reproduction.

I do not believe _id can explain that.

I think you are indexing documents with different content which would explain this.

Topic		Replies	Views
SystemMemoryOutOfException While Indexing Documents as attachments Elasticsearch	14	2453	July 5, 2017
ES OutOfMemoryError while indexing a large number of attachments Elasticsearch	9	512	July 6, 2017
How to index text file having size more than the system memory Elasticsearch	8	2103	July 6, 2017
ES + Attachment --> indexed documents incomplete Elasticsearch	11	637	July 6, 2017
Java.lang.OutOfMemoryError - how to anticipate memory usage? Elasticsearch	8	709	July 6, 2017

SystemMemoryOutofException thrown while indexing files as an attachment

Related topics