java.lang.OutOfMemoryError: Java heap space while indexing xml file

HI,

I'm indexing a XML file and given ElasticSearch as output but after approx. 2 hours i got OOM error like this

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid6940.hprof ...
Heap dump file created [9277681492 bytes in 117.530 secs]
[2017-08-04T23:02:13,255][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>java.lang.OutOfMemoryError: Java heap space, "backtrace"=>[]}
[2017-08-04T23:02:15,181][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. 
Error: Your application used more memory than the safety cap of 6G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Here is my configuration file:

input 
{
	file {
        path => "path/drug.xml"
		type => "drugbank"
		start_position => beginning
		sincedb_path => "/dev/null"
		codec => multiline
				{
					pattern => "^<\?drugbank.*\>"
					negate => true
					what => "previous"
					max_bytes => "400 MiB"
					max_lines => "5000000000"

				}
		}
}

filter {
        xml {
                source => "message"
target => "xmldata"
store_xml => "false"
        xpath => [ "/drugbank/drug", "drug"]
       
 }
       
 mutate {
            remove_field => [ "message", "inxml", "xmldata" ]
        }

        split {
                field => "[drug]"
        }

        xml {
                source => "drug"
                store_xml => "false"
				xpath => [ "/drug/drugbank-id/text()", "Drug ID"]
				xpath => [ "/drug/name/text()", "Drug name"]
				xpath => [ "/drug/targets/target/polypeptide/gene-name/text()", "Gene"]
        }
		 mutate {
                replace => {
				"Drug ID" => "%{[Drug ID][0]}"
                "Drug name" => "%{[Drug name][0]}"
                "Gene" => "%{[Gene][0]}"
                }
			}	
    mutate {
        remove_field => [ "drug"]
    }
    }
    output
    { 
     elasticsearch {
       codec => json
       hosts => ["10.xx.xx.xx2","10.xx.xx.xx4"]
       index => "test_index"
     }
    }

Any suggestion much appreciated.

Thanks

How big is the file? Have you verified that the multiline codec is doing the right thing?

File size is 397 MB and it contains approx 10 million lines. Yes i have verified multiline codec is working fine with a small sample file with 7-8 records of 5 thousand lines.

Initially what happening is, records got truncated because of default setting which is 500 lines MAX_LINE and 10 MB MAX_BYTE so i changed it to 5000000000 and 400 MB respectively afterwards my code start working fine.

I have ELK cluster on 3 machines (Coordinator, Master, DataNode ) and also set LS_HEAP_SIZE=12g and ES_HEAP_SIZE=12g and JVM max size is 6g.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.