How to bulk load huge JSON docs from files using Java client?

Hi.
I need your help / advice / opinion:

TL;DR: How do I read file from the disk for bulk loading using IndexRequest.source(BytesReference)?

And the detailed question:

The context:
I am indexing text in a public dataset, such as: http://data.cityofnewyork.us/resource/nc67-uf89
The dataset includes some simple meta-data fields (such as dataset name and description) and a large CSV table of the data itself.
I am indexing the text from the big CSV table.
The CSV table size may be a few GBs size.

My index mapping:
I had 2 options:
(1) hold each row in CSV as a single document,
or
(2) create a single document per each dataset, containing title, description, and all the text in the CSV as a very large field of array of text.

I chose option (2).

My question:
During bulk loading, each of my JSON documents is huge. Each JSON line in an NDJSON file may be longer than 1GB!

I use Java high-level API client.
I use the method: BulkProcess.add(IndexRequest)

Usually, IndexRequest.source() gets an argument of in-memory bytes, such as String, Map or byte[].

However, since my source document is huge (may be bigger than 1GB), I prefer to stream it instead of holding it completely inside Java heap memory.

The underlying Apache HTTP client supports streaming file upload of course (it's kind of standard HTTP file upload from file with low memory consumption).

Now, I wonder how to use Elasticsearch Java client API in order to stream the contents of a file.

I see that IndexRequest.source() can get an argument of type BytesReference, but I didn't find any reference implementation or example how to implement a BytesReference which reads from a file on disk.
Implementing it requires high skills of Java NIO, careful debugging and testing many edge cases, so I prefer finding an existing solution instead of developing my own.

So do you have an idea how to read files from disk using BytesReference?

Thanks.

Elasticsearch is not a BLOB store, so choosing option 2 does in my opinion not make any sense. I would therefore recommend switching to option 1.

Hi Christian.

Storing CSV text in a field is not to BLOB-store but to be able to search for values inside the CSV. It's actually working but I am not sure if it could scale well (ES has 2GB doc limit).

In option(1) I need to have nested-docs / parent-child relationship which I prefer to avoid.

ES has a default limit of around 100MB per request/document and 2 billion documents per shard. Elasticsearch was never designed to handle documents as large as you are describing. If you want to index these as individual documents I suspect you have chosen the wrong technology. Splitting it up is in my mind the right way to go if you want to use Elasticsearch.

How do you want to use the data once it is indexed? What would you need parent-child and/or nested docs?

Please take a look at the URL I have posted in the initial post.

It describes a "dataset" with some meta-data, and it has a "table" associated with it. I need to search inside both: inside dataset description text and inside CSV text, and to bring back both of them as a single "entity" - the dataset. For example, if you search for the term "violation" you will find it in the dataset description, and if you search for "double parking" you will find it inside the CSV data.

Now, if I search for: "violation double parking" I want this dataset to appear in the results with a high rank, because all terms appear in the document.

If I use option (1) I have to search in one "meta-data" document and in many "CSV rows" and to somehow aggregate all these entities into a single "entity" / "dataset", because in the results I want to get this dataset: "Open Parking and Camera Violations" as a single result hit.

Did you get my point?

Why not store a (possibly compressed) version of the data set somewhere (S3, file system, blob store) and then index the individual rows together with the description, location and metadata in Elasticsearch. You can then use Elasticsearch to identify the id and location of the full data set and then retrieve it from where it is stored. This saves you from having to store and retrieve huge documents from Elasticsearch, which is something it was never designed or optimised for.

So if I understand you correctly, your idea is, in other words, in case I have one-to-many relationship (such as invoice and invoice-details-lines for example), to repeat and store the "one" entity's fields with each and every entity of the "many" entities, in a "de-normalized" form?

For example, with the invoice example, let's say we have an invoice with common fields like "date", "invoice owner", "subject", "billing address", "shipping address", etc. Then we have a few details lines with fields like "item", "sku", "unit price", "quantity", "total", etc.
So you suggest that I will use a single ES document for each "detailed line" and add to this line the common invoice field ("invoice owner", "subject", etc.)

This way if I search for terms in "item name" and "billing address" fields, ES will find them together in a single doc and will set the relevance score accordingly.

Is that what you meant?

If so, there is a lot of repeating of the common fields.
Does ES detects this duplication and make some optimization to save redundant space?

Yes, something like that.

Indexing lots of documents with a common structure and the same fields is very common and optimised for.

OK thanks Christian, I will check it out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.