Data from an external source

obones · April 16, 2019, 9:23am

Hello,

I'm still in the early stages of discovering ElasticSearch so I'll surely make some bogus assertions or assumptions here, please do not hesitate to correct me.

What I understand so far is that data is fed into ElasticSearch via a HTTP(s) PUT request and then I can use ES and Kibana to do some searches and display the results in very nice graphs.

I have a dataset here which takes quite a lot of disk space. It is saved in a custom format that is basically a huge table of values. Each column has an associated metadata telling its name, its datatype...
The original code reading/writing this was written in Delphi but has been ported in Java for use in a Scala package that presents it as a RDD for Apache Spark to use as a data source.

I could import this dataset into ES via bulk import or lots of PUT requests, but I suspect it would take quite a lot of time and use a lot of disk space. And that import would have to be repeated each time the source dataset gets changed.

I was thus wondering if I could write some sort of a plugin that would allow ES to directly read that data instead of importing it.

Is this possible? Are there any drawbacks with that approach?

Thanks for any pointers on that subject.

warkolm · April 16, 2019, 9:26am

Welcome!

It might be technically possible to write a complete new storage driver that integrates into Elasticsearch to read your file, but possible doesn't mean it's feasible.

I'd look at indexing the data into Elasticsearch some other way

obones · April 16, 2019, 12:16pm

Thanks for the suggestion, but I was under the impression that this would essentially be an complete import of the datatable. Did I miss something in how ES works?

For instance, let's consider my data source is a CSV file. If I import it into ES, I basically create a copy of the CSV file in another format so that ES can work with it.
If I was to simply "index" it, what would happen?
I have a bit of a hard time distinguishing the two things.

warkolm · April 16, 2019, 8:15pm

The process of putting data into Elasticsearch is called indexing. It does mean putting a copy of the data in Elasticsearch.

obones · April 17, 2019, 1:04pm

Ok, thanks, that makes things clearer.

So, if I read you correctly, I would have to find a way to send my data into ES in a more efficient way than doing this:

binary -> json -> json parsing by ES -> storage by ES

To me, this is clearly not efficient because of the "Binary to JSON" conversion, but is there any documentation detailing what the ES storage format is, and if there is a way to directly write to it?

lemon_soft · April 17, 2019, 7:21pm

For my case, I already have a database in PostgreSql. Then I used ABC from GitHub. ABC imported my database from postgresql into elasticsearch. Now I have a simple and fast search engine. (~780gb)

warkolm · April 17, 2019, 8:39pm

I dunno if you will find one. Is it going to be hard to change that binary format into json?
I ask because you're considering writing a new storage layer, or trying to create index files directly without Elasticsearch. Those are pretty inefficient themselves and it seems like you're trying to reinvent the wheel a little.

There's no way to do this that we support. What I mean by that is that I have seen people do it, but it's not guaranteed to work and we won't help you to do it.

obones · April 18, 2019, 10:01am

Hard, well, no, it's quite trivial. But it's going to be highly inefficient.
When I first tried Apache Spark, I was told to convert this kind of data to CSV, then import it as a RDD and work with that.
Clearly, this took ages, was memory intensive, and not really user friendly.
But there is an API in Apache Spark that allows one to write its own RDD reader, with a bit of Java and Scala code. It's not trivial, the documentation is not that great, but at least, it's here for use.
And once I have had created this new reader, it works smoothly on the Spark cluster.

So, basically, I am looking into the existence of an API that would allow me to do the same for ES because I have a strong feeling that generating and parsing JSON is a source of delays and errors that ought to be avoided if possible.

system · May 16, 2019, 10:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How is Hadoop and ES typically used? Elasticsearch es-hadoop	8	1713	July 6, 2017
Cassandra data into ES : Elassandra? Elasticsearch	1	2211	March 20, 2017
Import data from folder Elasticsearch	3	2086	April 28, 2017
How to insert data to existing Elasticsearch table from spark Elasticsearch	1	532	July 6, 2017
Data is imported into ES in batches, and it can be queried after four minutes Elasticsearch	2	207	June 13, 2022

Data from an external source

Related topics