Hi.
I need your help / advice / opinion:
TL;DR: How do I read file from the disk for bulk loading using IndexRequest.source(BytesReference)?
And the detailed question:
The context:
I am indexing text in a public dataset, such as: http://data.cityofnewyork.us/resource/nc67-uf89
The dataset includes some simple meta-data fields (such as dataset name and description) and a large CSV table of the data itself.
I am indexing the text from the big CSV table.
The CSV table size may be a few GBs size.
My index mapping:
I had 2 options:
(1) hold each row in CSV as a single document,
or
(2) create a single document per each dataset, containing title, description, and all the text in the CSV as a very large field of array of text.
I chose option (2).
My question:
During bulk loading, each of my JSON documents is huge. Each JSON line in an NDJSON file may be longer than 1GB!
I use Java high-level API client.
I use the method: BulkProcess.add(IndexRequest)
Usually, IndexRequest.source() gets an argument of in-memory bytes, such as String, Map or byte[].
However, since my source document is huge (may be bigger than 1GB), I prefer to stream it instead of holding it completely inside Java heap memory.
The underlying Apache HTTP client supports streaming file upload of course (it's kind of standard HTTP file upload from file with low memory consumption).
Now, I wonder how to use Elasticsearch Java client API in order to stream the contents of a file.
I see that IndexRequest.source() can get an argument of type BytesReference, but I didn't find any reference implementation or example how to implement a BytesReference which reads from a file on disk.
Implementing it requires high skills of Java NIO, careful debugging and testing many edge cases, so I prefer finding an existing solution instead of developing my own.
So do you have an idea how to read files from disk using BytesReference?
Thanks.