Data from an external source


#1

Hello,

I'm still in the early stages of discovering ElasticSearch so I'll surely make some bogus assertions or assumptions here, please do not hesitate to correct me.

What I understand so far is that data is fed into ElasticSearch via a HTTP(s) PUT request and then I can use ES and Kibana to do some searches and display the results in very nice graphs.

I have a dataset here which takes quite a lot of disk space. It is saved in a custom format that is basically a huge table of values. Each column has an associated metadata telling its name, its datatype...
The original code reading/writing this was written in Delphi but has been ported in Java for use in a Scala package that presents it as a RDD for Apache Spark to use as a data source.

I could import this dataset into ES via bulk import or lots of PUT requests, but I suspect it would take quite a lot of time and use a lot of disk space. And that import would have to be repeated each time the source dataset gets changed.

I was thus wondering if I could write some sort of a plugin that would allow ES to directly read that data instead of importing it.

Is this possible? Are there any drawbacks with that approach?

Thanks for any pointers on that subject.


(Mark Walkom) #2

Welcome!

It might be technically possible to write a complete new storage driver that integrates into Elasticsearch to read your file, but possible doesn't mean it's feasible.

I'd look at indexing the data into Elasticsearch some other way :slight_smile:


#3

Thanks for the suggestion, but I was under the impression that this would essentially be an complete import of the datatable. Did I miss something in how ES works?

For instance, let's consider my data source is a CSV file. If I import it into ES, I basically create a copy of the CSV file in another format so that ES can work with it.
If I was to simply "index" it, what would happen?
I have a bit of a hard time distinguishing the two things.


(Mark Walkom) #4

The process of putting data into Elasticsearch is called indexing. It does mean putting a copy of the data in Elasticsearch.


#5

Ok, thanks, that makes things clearer.

So, if I read you correctly, I would have to find a way to send my data into ES in a more efficient way than doing this:

binary -> json -> json parsing by ES -> storage by ES

To me, this is clearly not efficient because of the "Binary to JSON" conversion, but is there any documentation detailing what the ES storage format is, and if there is a way to directly write to it?


(Lemon Soft) #6

For my case, I already have a database in PostgreSql. Then I used ABC from GitHub. ABC imported my database from postgresql into elasticsearch. Now I have a simple and fast search engine. (~780gb)


(Mark Walkom) #7

I dunno if you will find one. Is it going to be hard to change that binary format into json?
I ask because you're considering writing a new storage layer, or trying to create index files directly without Elasticsearch. Those are pretty inefficient themselves and it seems like you're trying to reinvent the wheel a little.

There's no way to do this that we support. What I mean by that is that I have seen people do it, but it's not guaranteed to work and we won't help you to do it.


#8

Hard, well, no, it's quite trivial. But it's going to be highly inefficient.
When I first tried Apache Spark, I was told to convert this kind of data to CSV, then import it as a RDD and work with that.
Clearly, this took ages, was memory intensive, and not really user friendly.
But there is an API in Apache Spark that allows one to write its own RDD reader, with a bit of Java and Scala code. It's not trivial, the documentation is not that great, but at least, it's here for use.
And once I have had created this new reader, it works smoothly on the Spark cluster.

So, basically, I am looking into the existence of an API that would allow me to do the same for ES because I have a strong feeling that generating and parsing JSON is a source of delays and errors that ought to be avoided if possible.