Convert json to ndjson


#1

What's the best way to convert json to ndjson?


(Shane Connelly) #2

Can you describe a bit more about what you're trying to do?


#3

I'm trying to index a json document that's 287k lines. Need to format it with new lines so Elastic will index it...


#4

Would minifying my json file work the same way? Would that give it the new lines needed for bulk indexing?


(Shane Connelly) #5

Presumably, that 287k document is actually a bunch of smaller documents with some sort of delimiting (or structured as an array of json documents). Is that correct?


#6

I guess, yes. Here's an example snippet...

[ { title: [ 'A', [length]: 1 ],
mainTerm:
[ { title: [ 'Aarskog's syndrome', [length]: 1 ],
code: [ 'Q87.1', [length]: 1 ] },
{ title: [ 'Abandonment', [length]: 1 ],
see: [ 'Maltreatment', [length]: 1 ] },
{ title:
[ { nemod: [ '(-astasia) (hysterical)', [length]: 1 ],
_: 'Abasia' },
[length]: 1 ],
code: [ 'F44.4', [length]: 1 ] },
{ title:
[ { nemod: [ '(cystinosis)', [length]: 1 ],
_: 'Abderhalden-Kaufmann-Lignac syndrome' },
[length]: 1 ],
code: [ 'E72.04', [length]: 1 ] },
{ title: [ 'Abdomen, abdominal', [length]: 1 ],
term:
[ { title: [ 'acute', [length]: 1 ],
'$': { level: '1' },
code: [ 'R10.0', [length]: 1 ] },
{ title: [ 'angina', [length]: 1 ],
'$': { level: '1' },
code: [ 'K55.1', [length]: 1 ] },
{ title: [ 'muscle deficiency syndrome', [length]: 1 ],
'$': { level: '1' },
code: [ 'Q79.4', [length]: 1 ] },
[length]: 3 ],
seeAlso: [ 'condition', [length]: 1 ] },


(Shane Connelly) #7

If that's what the actual data looks like, the first step is going to be to get it into actual JSON. It's going to need double quotes around those field names and string values. Elasticsearch used to allow a lot of funky not-quite-JSON stuff including field names without quotes around them, but it broke stuff (e.g. https://github.com/elastic/elasticsearch/issues/9800). Also, Elasticsearch doesn't support arrays with a mixture of datatypes (https://www.elastic.co/guide/en/elasticsearch/reference/5.3/array.html), so you'll need a bit of manipulation there as well. It looks like the [length] piece is just describing the length of the array, and if that's so, it can just go away.

If you have to manipulate the data anyway, you may want to just produce an output in the bulk format when you do.


#8

Good catches. I converted it from xml. I'm replacing the single quotes with doubles and the I'll remove the [length] as well...

I'm not following what you mean on producing output in the bulk format... That's ultimately what I'm trying to get to..


(Shane Connelly) #9

What I mean is that right now -- or soon anyway -- you're going from XML to JSON, and then you're looking to go from JSON to the bulk format. Since you have to reprocess the XML, you could just skip the JSON middleman and go straight from XML to the bulk format. Or, honestly, a few hundred thousand items is probably going to be pretty fast to index anyway so you could just post each entity directly to Elasticsearch without going through the bulk endpoint, assuming that's all you're trying to do.

But to answer your question directly, if you've got a json file with an array of objects, I've used jq (https://stedolan.github.io/jq/) before to convert the array into a bulk format. It can also give you an opportunity to manipulate the data a little bit or drop entites along the way. As an example, I recently had a file that contained an array of roads with speedlimits that I wanted to convert into a bulk format. The following converted, adding the index commands in every other line

jq -c '.[] | select(.speed_limit != "0" and .speed_limit != "99")' tmp/roads.json | sed -e 's/^/{ "index" : { "_index" : "speedlimits", "_type" : "limits" } }\
/' > tmp/roads_bulk.json

#10

It would be great to skip the json middleman. How the heck do I do that?


(Shane Connelly) #11

That may depend on what you're using when you say

I converted it from xml.

If you've got a custom script/program doing the XML conversion, you could just edit that custom program to output into the bulk format. You could also just set up Logstash with a file input (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html) an XML filter (https://www.elastic.co/guide/en/logstash/current/plugins-filters-xml.html) and an Elasticsearch output (https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html) and let it deal with all the conversion/bulk for you.


#12

Fantastic! Thank you very much for this. I'll go through the documentation and install it and give it a try. It would be fantastic to be able to just load the xml into this and let it deal...

Glad I reached out...

Thanks again Shane!


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.