Large file to elasticsearch index in fast

Hi.

In many case we do indexing from files
sometimes very large file. GBs.

What is your favorite way to index large json file to elasticsearch?

Logstash-file is very good and easy way to do that.
But logstash-file aimed to streaming small chunks.

Is there any impressively fast and safe way to index a single large json file to elasticsearch?

when I ran rally on my server rally indexed 25000 docs per second. I dont know rally's internal indexing rule, mappings.
but I only were able to hit 7000 indexing per second via logstash.
many filter workers, elasticsearch workers did not increased indexing performance. at that time elasticseaech's active bulk thread count were 0 or 1.

Test mapping is so simple. just 10 of not analyzed fields.

1 Like

Hi,

great to hear you are using Rally (which I author). However, I fear you are mixing up a few things. If you just ran esrally, it indexes country data from geonames. It also uses JSON data which are already prepared for bulk indexing (as we don't want to stress test the load test driver but Elasticsearch). Indexing is done with 8 client threads and a bulk size of 5000 documents per bulk.

But I assume you feed Logstash not the geonames file but your own data file which has a different structure. So this is already an apples to orange comparison. Next, the Logstash-file filter has to do more work than Rally, because it needs to read your data and convert it to a format that can be used for bulk indexing. Third, if any of your other filters in the Logstash processing pipeline doesn't support bulk processing, it has to fall back to sending events individually (so your problem could also be related to other filters).

It really depends on your concrete use case but as a first step, you could look at all of your filters. If you need to import these files only once you could also directly index them directly with the bulk index API of Elasticsearch. This would also be a better baseline for your performance comparison (alternatively, you could also write your own track with your data in Rally).

If you have specific follow-up questions related to Logstash I think it would be best if you use the Logstash forum.

Daniel

@danielmitterdorfer Thank you for the detailed answer!

Actually there is only date filter in ther filter block. if I set elasticsearch output plugin to 8 workers, 5000 bulksize, use same document and mapping, then its two might be a little more comparable, right?

Anyway, rally is really straightforward and cool for now. good for set a base line.

thanks again.

Jihun

Hi @no_jihun,

as in one case you import data via Logstash and in the other case directly to Elasticsearch it is still a bit hard to compare correctly. But it's good that you have chosen the same bulk size and workers.

I'm not sure whether you are aware that Rally starts a local Elasticsearch cluster by default against it will benchmark. If you want to benchmark against your own cluster, you can do this by running:

esrally --pipeline=benchmark-only --target-hosts=your_elasticsearch_host_name:your_elasticsearch_port

Depending on the version of Elasticsearch you need to benchmark, you'll also have to provide the Elasticsearch version with e.g.--distribution-version=5.0.0-alpha1 if your cluster version is 5.0.0-alpha1. This information is needed by Rally to know which mapping file to load (and a few other things). However, support for older versions of Elasticsearch is not that good currently in Rally. It is doable if you have detailed knowledge about Rally but I work on improving the support for that currently.

Also please note that Rally assumes that the cluster contains no documents at all (it will verify that it has indexed the correct number of documents and will fail otherwise). This is no problem if Rally has complete control over the cluster but it is something you need to ensure currently when you benchmark a cluster which is not under Rally's complete control by using --pipeline=benchmark-only.

For more details please see:

Daniel