Parse json file with 100k/1 million records?

i am trying to parse big json file , here is the sample 5 records in the file with the orignial file containing 100k

{"dgdocs": [{"doc id": "0", "docowner": "0", "documentType": "cte1_0", "docver": "1", "ogsys": "fc", "docstat1": "smartsys", "claimId": ""}, {"doc id": "1", "docowner": "1", "documentType": "cte1_0", "docver": "1", "ogsys": "fc", "docstat1": "smartsys", "claimId": ""}, {"doc id": "2", "docowner": "2", "documentType": "cte1_0", "docver": "1", "ogsys": "fc", "docstat1": "smartsys", "claimId": ""}, {"doc id": "3", "docowner": "3", "documentType": "cte1_0", "docver": "1", "ogsys": "fc", "docstat1": "smartsys", "claimId": ""}, {"doc id": "4", "docowner": "4", "documentType": "cte1_0", "docver": "1", "ogsys": "fc", "docstat1": "smartsys", "claimId": ""}], "totalCount": 5}

here is my config script :-

 input 
 {  
   file
   {
	
	path => "${LS_XML_PATH}/*.json"
	start_position => "beginning"
	
	sincedb_path => "NUL"
    exclude => "*.gz"
	mode => "read"
	
	#codec => json_lines // tried this with while commenting other codecs
             #code => json  // tried this with while commenting other codecs
    codec => multiline  // tried this with while commenting other codecs
    {
        pattern => '^\{'
        #pattern => "^}"
     #pattern => "dgdocs"
         negate => true
        what => previous 
        max_lines => 2000000
  auto_flush_interval => 3
    }

}  
 }

  filter
   {
json
  {
    source => "message"
  }



 if [dgdocs]
{
    split 
    {
    id => "cte1main_split_field"
    field => "[dgdocs]"
    }
}

 mutate {  gsub => ["message","^(.*)dgdocs: ",""]}

mutate
{
    add_field =>  
    { 
    "creationDate"  =>	"%{[@metadata][creationDate]}"
    "docId"  =>	"%{[dgdocs][docId]}"
    "docOwner"  =>	"%{[dgdocs][docOwner]}"
    "docType"  =>	"%{[dgdocs][docType]}"
    "docVersion"  =>	"%{[dgdocs][docVer]}" 
    "ogsys"  =>	"%{[dgdocs][ogsys]}" 
    "docStatus1"  =>	"%{[dgdocs][docStat1]}"
    "event1Timestamp"  =>	"%{[@metadata][creationDate]}"
    "claimId"  =>	"%{[dgdocs][claimId]}"
    
    }
}

  mutate 
{
     remove_field => 
 ["message","tags","@timestamp","host","@version","path","dgdocs"]

}

 }

  output
 {
#for debugging purpose
stdout	{    codec => rubydebug    }


}

Do you have a question?

@Badger,

i have updated my scripted which is working till 1000 records after which when i try with 5000 records and it fails with below error : -

buffer_extract: a delimiter can't be found in current chunk, maybe there are no more delimiters or the delimiter is incorrect or the text before the delimiter, a 'line', is very large, if this message is logged often try increasing the `file_chunk_size` setting. {"delimiter"=>"\n", "read_position"=>1015808, "bytes_read_count"=>12009, "last_known_file_size"=>1027817, 

and Also gives OOM ,

 java.lang.OutOfMemoryError: Java heap space  
Dumping heap to java_pid12264.hprof ...   
Heap dump file created [2224914329 bytes in 16.233 secs ]
warning: thread "[main]>worker0" terminated with exception (report_on_exception is 
true):
java.lang.OutOfMemoryError: Java heap space    
  warning: thread "[main]<file" terminated with exception (report_on_exception is   true):
java.lang.OutOfMemoryError: Java heap space 

could you please help ?

hi @Badger,

we have modified the json in below format :

{"tc":"2","dg":[
{"docId": "8956","docOwner": "R909168111","docPath": "","docType":"clm"},
{"docId": "5623","docOwner": "R909168112","docPath": "","docType": "clm"},
{"docId": "4152","docOwner": "R909168113","docPath": "","docType": "clm"}
]}

for that i am using below script :_

input {
file {
path => /path/to/json
codec => plain
}
}

filter {
json
{
source => "message"
}

if [totalCount] {
ruby {
code => '
tc= event.get("tc")
logger.info("the total count:", "tc" => tc)
'
}
drop {}
}

}

output {
stdout { codec => rubydebug }

elasticsearch {
hosts => ["localhost:9200"]
index => "test-q4-2019"
}
}

there will be million records like this and i see error like this

rror parsing json {:source=>"message", :raw=>"]}", :exception=>#<LogStash::Json::ParserError: Unexpected close marker ']': expected '}' (for root starting at [Source: (byte)"]}"; line: 1, column: 0])

:exception=>#<LogStash::Json::ParserError: Unexpected character (',' (code 44)): expected a value
at [Source: (byte)"

please help

If your JSON is spread across multiple lines then you will need to use a multiline codec to combine them.

hi @Badger,

we have solved the issue , but now i am facing a different issue which is smilar to below link

some fields are added to the end of json like this

{"docId": "201","docOwner": "R909168111","docType":"delay","event1Timestamp":"2019-08-20T04:28:40.889Z", "addinfo":[{"key": "clid","value": "p111"}]
}

filter {

mutate {
strip => ["message"]
}

if [message] == ']}' {
drop{}
}

mutate { gsub => ["message",",$",""] }

json {
source => "message"
}

}

but how to extract the nested array of inner objects and inserted as it is elasticsearch?

i get the output in logstash debug as

"addInfo" => [
[0] {
"key" => "clid",
"value" => "p111"
},
[1] {
"key" => "clid1",
"value" => "p222"
}
],

but in elasticsearch the o/p is :-

"addInfo" : [
{
"key" : "clid",
"value" : "p111"
},
{
"key" : "clid1",
"value" : "p222"
}
],

is that fine ?

Yes, rubydebug and elasticsearch have different ways of displaying arrays.

1 Like

Thanks @badger

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.