java.lang.OutOfMemoryError: Java heap space while indexing xml file


(Aashish Chauhan) #1

HI,

I'm indexing a XML file and given ElasticSearch as output but after approx. 2 hours i got OOM error like this

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid6940.hprof ...
Heap dump file created [9277681492 bytes in 117.530 secs]
[2017-08-04T23:02:13,255][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>java.lang.OutOfMemoryError: Java heap space, "backtrace"=>[]}
[2017-08-04T23:02:15,181][ERROR][logstash.pipeline        ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. 
Error: Your application used more memory than the safety cap of 6G.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

Here is my configuration file:

input 
{
	file {
        path => "path/drug.xml"
		type => "drugbank"
		start_position => beginning
		sincedb_path => "/dev/null"
		codec => multiline
				{
					pattern => "^<\?drugbank.*\>"
					negate => true
					what => "previous"
					max_bytes => "400 MiB"
					max_lines => "5000000000"

				}
		}
}

filter {
        xml {
                source => "message"
target => "xmldata"
store_xml => "false"
        xpath => [ "/drugbank/drug", "drug"]
       
 }
       
 mutate {
            remove_field => [ "message", "inxml", "xmldata" ]
        }

        split {
                field => "[drug]"
        }

        xml {
                source => "drug"
                store_xml => "false"
				xpath => [ "/drug/drugbank-id/text()", "Drug ID"]
				xpath => [ "/drug/name/text()", "Drug name"]
				xpath => [ "/drug/targets/target/polypeptide/gene-name/text()", "Gene"]
        }
		 mutate {
                replace => {
				"Drug ID" => "%{[Drug ID][0]}"
                "Drug name" => "%{[Drug name][0]}"
                "Gene" => "%{[Gene][0]}"
                }
			}	
    mutate {
        remove_field => [ "drug"]
    }
    }
    output
    { 
     elasticsearch {
       codec => json
       hosts => ["10.xx.xx.xx2","10.xx.xx.xx4"]
       index => "test_index"
     }
    }

Any suggestion much appreciated.

Thanks


(Magnus Bäck) #2

How big is the file? Have you verified that the multiline codec is doing the right thing?


(Aashish Chauhan) #3

File size is 397 MB and it contains approx 10 million lines. Yes i have verified multiline codec is working fine with a small sample file with 7-8 records of 5 thousand lines.

Initially what happening is, records got truncated because of default setting which is 500 lines MAX_LINE and 10 MB MAX_BYTE so i changed it to 5000000000 and 400 MB respectively afterwards my code start working fine.

I have ELK cluster on 3 machines (Coordinator, Master, DataNode ) and also set LS_HEAP_SIZE=12g and ES_HEAP_SIZE=12g and JVM max size is 6g.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.