Out of memory error and duplicate rows

darius12 · December 23, 2022, 3:46pm

getting this error when trying to index using logstash:

warning: thread "[main]>worker1" terminated with exception (report_on_exception is true):
java.lang.OutOfMemoryError: UTF16 String size is 1371255266, should be less than 1073741823

also, the customer_id is unique in the customer table, however the logstash is repeating the records more than 500 times per same customer_id

========

Heap size is 32 GB, server has 64 GB memory

We are attempting to index 1 terabyte of data, which is about 500,000 of our records.

Our logstash script is below. Any help with the java.lang.OutofMemoryError and repeating rows would be greatly appreciated.

input {
  jdbc {
    jdbc_connection_string => "jdbc:sqlserver://SERVERA-DB:1433;databaseName=db_1;integratedsecurity=true;"
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_user => "bob_hanson"
    jdbc_paging_enabled => "true"
    jdbc_page_size => "5000"
    statement => " SELECT customer_id,  first_name, middle_initial, last_name, address1, address2, city, zip_code 
               from [customers]
    tags => ["customer_search_tags"]  
    }
}

filter {
    jdbc_streaming {
        jdbc_connection_string => "jdbc:sqlserver://SERVERA-DB:1433;databaseName=db_1;integratedsecurity=true;"
     jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
        jdbc_user => "bob_hanson"
        statement => " SELECT document_id, document, document_content_type, document_name FROM Customer_Documents cd

                        WHERE document is not null AND document_content_type in ('application/pdf', 'application/msword', 'text/plain','application/atom+xml','application/msaccess','application/msexcel','application/vnd.ms-excel','application/vnd.ms-officetheme','application/vnd.ms-outlook',
                                        'application/vnd.ms-powerpoint','application/vnd.ms-word.document.macroEnabled.12','application/vnd.openxmlformats-officedocument.pres','application/vnd.openxmlformats-officedocument.spre','application/vnd.openxmlformats-officedocument.word','application/vnd.visio','application/x-cpy',
                                        'message/rfc822','text/css','text/html', 'text/xml')  AND cd.customer_id = :cc_customer_id"
        parameters => { "cc_customer_id" => "customer_id"}
        target => "attachments"
    }
}



output {
  elasticsearch {
    hosts => [https://XXXXXXXXXXXX:9200]
    cacert => "E:\logstash\config\blog_cert.pem"
    ssl => true
    ssl_certificate_verification => false
    pipeline => "attachments"
    index => "customer_search_entries"
    user => "elastic"
    password => "XXXXXXXX=="
  }
}

Badger · December 23, 2022, 6:21pm

Is it possible that one of your documents is over 1 GB? A few years ago the maximum string size in Java was reduced from 2 GB (because each UTF-16 character uses 2 bytes and it uses a 32-bit string length) to 1 GB (a side-effect of compact strings which are byte arrays rather than char arrays).

Is the pipeline getting restarted over and over again due to this exception? If so, perhaps use the customer_id as the document_id in elasticsearch, so that it will just keep overwriting the document.

darius12 · December 23, 2022, 6:46pm

Yes, definitely there are some documents over 1 GB in size. Is there a way to index in this case?

Badger · December 23, 2022, 7:25pm

I'm not sure, but I doubt it. Even if you could cast it to some other type that logstash could process I suspect it would then get the same issue on the elasticsearch side. The default limit on document size in elasticsearch is 100 MB, Lucene supports up to 2 GB but elastic recommend against it.

darius12 · December 27, 2022, 2:29pm

thanks for the helpful feedback!

system · January 24, 2023, 2:29pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
EXCEPTION : java.lang.OutOfMemoryError: GC overhead limit exceeded Logstash	9	3841	July 6, 2017
Logstash Out of memory Logstash	23	13594	July 6, 2017
Java.lang.OutMemoryError Logstash	9	709	May 26, 2017
Logstash's OutOfMemoryError Logstash	2	1485	July 6, 2017
java.lang.OutOfMemoryError: Java heap space when transferring data from jdbc to elasticsearch via logstash Logstash	1	2594	March 15, 2018

Out of memory error and duplicate rows

Related topics