Dump JSON directly in to ElasticSearch

Hi,

I have this requirement of dumping big JSON file with over 700K records into ES directly. However the JSON I have is formatted in PRETTY array format as below:

STRING 1:
indent preformatted text by 4 spaces

[
  {
    "recordId": 1,
    "FirstName": "Julie",
    "LastName": "Klier",
    "FullName": null,
    "email": "asdasd@gmail.com",
    "name": "sd fghgfh",
    "company": null,
    "leadsource": "MOT",
    "status": "Qualified",
    "site_id__c": "155731983",
    "createdDate": "2019-08-13T20:41:34",
    "sf_lead": [],
    "sf_contact": [
      {
        "id": "0031I00000tjNCxQAM",
        "accountid": null,
        "lastname": null,
        "firstname": null,
        "name": "Julie Klier",
        "email": "sadsad@gmail.com",
        "leadsource": "MOT",
        "createddate": "2019-08-13T20:41:34Z",
        "site_id__c": "155731983",
        "asi__c": null
      }
    ],
    "sf_account": [],
    "sf_oppurtunity": [
      {
        "id": "0061I00000J4DZWQA3",
        "contactid__c": null
      }
    ],
    "sf_campaign": [
      {
        "id": "7013u000000mlLyAAI"
      }
    ],
    "asidta_customer": [],
    "asidta_ordrhist": [],
    "asidta02_customer": [],
    "asidta02_ordrhist": [],
    "asidta04_customer": [],
    "asidta04_ordrhist": [],
    "asidta05_customer": [],
    "asidta05_ordrhist": []
  }
]

Now I want to convert this into below format:

STRING 2:

      {"index":{"_index":"allreportingdata","_id":1}}
  {
    "recordId": 1,
    "FirstName": "Julie",
    "LastName": "Klier",
    "FullName": null,
    "email": "asdasd@gmail.com",
    "name": "sd fghgfh",
    "company": null,
    "leadsource": "MOT",
    "status": "Qualified",
    "site_id__c": "155731983",
    "createdDate": "2019-08-13T20:41:34",
    "sf_lead": [],
    "sf_contact": [
      {
        "id": "0031I00000tjNCxQAM",
        "accountid": null,
        "lastname": null,
        "firstname": null,
        "name": "Julie Klier",
        "email": "sadsad@gmail.com",
        "leadsource": "MOT",
        "createddate": "2019-08-13T20:41:34Z",
        "site_id__c": "155731983",
        "asi__c": null
      }
    ],
    "sf_account": [ ],
    "sf_oppurtunity": [
      {
        "id": "0061I00000J4DZWQA3",
        "contactid__c": null
      }
    ],
    "sf_campaign": [
      {
        "id": "7013u000000mlLyAAI"
      }
    ],
    "asidta_customer": [],
    "asidta_ordrhist": [],
    "asidta02_customer": [],
    "asidta02_ordrhist": [],
    "asidta04_customer": [],
    "asidta04_ordrhist": [],
    "asidta05_customer": [],
    "asidta05_ordrhist": []
  }

Can someone please help me to achieve this? Also what is this STRING 2 called ?

The bulk format does not support pretty printing. Each document need to be on a single line.

Ok, so is there a way to convert this pretty format into single line format using some tool ? Also what is this single line JSON called ?

I would recommend creating a script to reformat this, e.g. using Python. Also note that the bulk api is designed for sending multiple events per request. Each request should however generally be no larger than 5MB so your data set would result in multiple bulk requests.

There's a lot of different tools that you can use for this. Writing a script in Python as @Christian_Dahlqvist suggests is on way, using something like jq is another e.g.

cat example.json | jq -c

This format is newline-delimited JSON (NDJSON), whereby each line is an individual JSON document, separated by a newline (\n, and also \r\n).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.