Why no native CSV import in bulk instead of JSON? It would make sense am I right?

Hi all! I was wondering for some time now why doesn't Elasticsearch allow for native CSV import? I don't mean converting the CSV to JSON and then using the Bulk API, but uploading a CSV to the server and then the server does the importing. Or maybe it does but I can't find it.

I will list down the number of benefits of this if this would be the case:

  • Direct import from different DB servers that can generate data dumps (psql, mysql) easily and natively so less conversion, and more interoperability with legacy systems.
  • Less time spend converting the csv to JSON on the client side to valid JSON saving CPU cycles and trees.
  • Less data to transfer from the client to the server for mostly tabular data really, saving data and bandwidth, saving sanity and trees. Json is really quite overkill if we're mostly sending tabular data with tis repeating of props.

The downside will be:

  • nested documents are a bit jerky in csv, but those people can still use the plain Bulk API.

For me personally it seems like a good addition to Elasticsearch that covers 80% of the use cases.

Would love to know what the reason is!

Hello @emilebosch

While we do not have a native endpoint on Elasticsearch side to import CSV files/format, on Kibana it is possible to upload CSV files. See more at this blog post.
Another blog post making use of the CSV import is available here.

Another approach is to use an Ingest pipeline with the csv processor (see documentation).

Hope it helps!

1 Like

Welcome!

While I agree that it's useful to have and as @Luca_Belluccini pointed out, we have something useful in Kibana for this, I don't think it would cover 80% of the use cases.

Elasticsearch can be used for so many things where CSV import is not needed like observability, enterprise search, security. IMHO the real need for such tool is more for one shot analysis where you just want to start your cloud.elastic.co cluster and drop inside some CSV coming from an open data source. Or for a POC project where you just want to check that Elastic Stack could fit with your use case and data.
That represents IMO a very small number of the cases I've seen over the last 9 years with Elasticsearch. So far from 80%.

Direct import from different DB servers that can generate data dumps (psql, mysql) easily and natively so less conversion, and more interoperability with legacy systems.

Indeed, most of the time, CSV files are coming from another system like a datastore or an application. I personally prefer connecting my application to elasticsearch without needing using a pivot file to export/import my data. I explained it in this blog post.

Less time spend converting the csv to JSON on the client side to valid JSON saving CPU cycles and trees.

Sure. That's why I recommend generating the data from the object in memory directly to JSON. Depending on your stack, you can find automatic converters to do that (ie Jackson lib in Java). When it's loaded in memory, it's blazing fast to transform it to JSON and send it to elasticsearch.

True. Note that you can activate compression as well.

I'm also sharing here a blog post which uses Logstash as well to do this for a big dataset (around 20m of lines).

My 2 cents.

1 Like

Alright, the CSV processor seems to have the least amount of overhead for ingesting. Will try to pursue that path for now! Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.