Convert json to ndjson

httpex · April 21, 2017, 9:43pm

What's the best way to convert json to ndjson?

shanec · April 21, 2017, 9:49pm

Can you describe a bit more about what you're trying to do?

httpex · April 21, 2017, 10:02pm

I'm trying to index a json document that's 287k lines. Need to format it with new lines so Elastic will index it...

httpex · April 21, 2017, 10:10pm

Would minifying my json file work the same way? Would that give it the new lines needed for bulk indexing?

shanec · April 21, 2017, 10:13pm

Presumably, that 287k document is actually a bunch of smaller documents with some sort of delimiting (or structured as an array of json documents). Is that correct?

httpex · April 21, 2017, 10:19pm

I guess, yes. Here's an example snippet...

[ { title: [ 'A', [length]: 1 ],
mainTerm:
[ { title: [ 'Aarskog's syndrome', [length]: 1 ],
code: [ 'Q87.1', [length]: 1 ] },
{ title: [ 'Abandonment', [length]: 1 ],
see: [ 'Maltreatment', [length]: 1 ] },
{ title:
[ { nemod: [ '(-astasia) (hysterical)', [length]: 1 ],
_: 'Abasia' },
[length]: 1 ],
code: [ 'F44.4', [length]: 1 ] },
{ title:
[ { nemod: [ '(cystinosis)', [length]: 1 ],
_: 'Abderhalden-Kaufmann-Lignac syndrome' },
[length]: 1 ],
code: [ 'E72.04', [length]: 1 ] },
{ title: [ 'Abdomen, abdominal', [length]: 1 ],
term:
[ { title: [ 'acute', [length]: 1 ],
'$': { level: '1' },
code: [ 'R10.0', [length]: 1 ] },
{ title: [ 'angina', [length]: 1 ],
'$': { level: '1' },
code: [ 'K55.1', [length]: 1 ] },
{ title: [ 'muscle deficiency syndrome', [length]: 1 ],
'$': { level: '1' },
code: [ 'Q79.4', [length]: 1 ] },
[length]: 3 ],
seeAlso: [ 'condition', [length]: 1 ] },

shanec · April 21, 2017, 10:37pm

If that's what the actual data looks like, the first step is going to be to get it into actual JSON. It's going to need double quotes around those field names and string values. Elasticsearch used to allow a lot of funky not-quite-JSON stuff including field names without quotes around them, but it broke stuff (e.g. https://github.com/elastic/elasticsearch/issues/9800). Also, Elasticsearch doesn't support arrays with a mixture of datatypes (https://www.elastic.co/guide/en/elasticsearch/reference/5.3/array.html), so you'll need a bit of manipulation there as well. It looks like the [length] piece is just describing the length of the array, and if that's so, it can just go away.

If you have to manipulate the data anyway, you may want to just produce an output in the bulk format when you do.

httpex · April 21, 2017, 10:41pm

Good catches. I converted it from xml. I'm replacing the single quotes with doubles and the I'll remove the [length] as well...

I'm not following what you mean on producing output in the bulk format... That's ultimately what I'm trying to get to..

shanec · April 21, 2017, 11:01pm

What I mean is that right now -- or soon anyway -- you're going from XML to JSON, and then you're looking to go from JSON to the bulk format. Since you have to reprocess the XML, you could just skip the JSON middleman and go straight from XML to the bulk format. Or, honestly, a few hundred thousand items is probably going to be pretty fast to index anyway so you could just post each entity directly to Elasticsearch without going through the bulk endpoint, assuming that's all you're trying to do.

But to answer your question directly, if you've got a json file with an array of objects, I've used jq (https://stedolan.github.io/jq/) before to convert the array into a bulk format. It can also give you an opportunity to manipulate the data a little bit or drop entites along the way. As an example, I recently had a file that contained an array of roads with speedlimits that I wanted to convert into a bulk format. The following converted, adding the index commands in every other line

jq -c '.[] | select(.speed_limit != "0" and .speed_limit != "99")' tmp/roads.json | sed -e 's/^/{ "index" : { "_index" : "speedlimits", "_type" : "limits" } }\
/' > tmp/roads_bulk.json

httpex · April 21, 2017, 11:11pm

It would be great to skip the json middleman. How the heck do I do that?

shanec · April 21, 2017, 11:22pm

That may depend on what you're using when you say

I converted it from xml.

If you've got a custom script/program doing the XML conversion, you could just edit that custom program to output into the bulk format. You could also just set up Logstash with a file input (File input plugin | Logstash Reference [8.11] | Elastic) an XML filter (Xml filter plugin | Logstash Reference [8.11] | Elastic) and an Elasticsearch output (Elasticsearch output plugin | Logstash Reference [8.11] | Elastic) and let it deal with all the conversion/bulk for you.

httpex · April 22, 2017, 1:28am

Fantastic! Thank you very much for this. I'll go through the documentation and install it and give it a try. It would be fantastic to be able to just load the xml into this and let it deal...

Glad I reached out...

Thanks again Shane!

system · May 20, 2017, 1:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem with bulk inserting json via python Elasticsearch	2	4335	August 10, 2018
Best way to export elastic index to json / local file format Elasticsearch	2	13695	December 3, 2019
Bulk import does nothing Elasticsearch	6	560	May 14, 2019
Dump JSON directly in to ElasticSearch Elasticsearch	5	908	August 10, 2020
Get Raw JSON file with half a million lines into elasticsearch Elasticsearch	5	1493	April 26, 2017

Convert json to ndjson

Related topics