Scaling Logstash

Hi All,

We have a system where we have to rebuild an index every night which has 300 Million documents. Our source of truth is SQL Server and every night we want to build a new index. Our document size is about 2-3 KB. I was wondering if there is some documentation out there that can help me scale logstash by adding multiple instances. What is the best throughput I can achieve using logstash.

Thanks.

Since the jdbc input itself is single-threaded you'll probably want multiple jdbc inputs to parallelize the processing and avoid making the input a bottleneck. Perhaps you can shard the input data on some field? For example, if you want ten jdbc inputs the first one would use the query

SELECT * FROM tablename WHERE id % 10 = 0

the second one

SELECT * FROM tablename WHERE id % 10 = 1

and so on (assuming the id column is an integer with sufficiently random distribution of values).

Thanks for the reply. I have attempted to do something similar, we have an id field which is an integer. Will I get a better throughput if I use one Logstash instance with multiple inputs or multiple Logstash instances with one jdbc input?

Thanks

I'm not sure if it matters. The other aspects that come to play can be tweaked.

@magnusbaeck - I'm using linux servers as my logstash instances. I have 10 such instances set up and I am dividing the data by performing a % on the id column. Each instance has 2GB RAM and 2 cores. With this set up I get a throughput of about 250 docs/second. I'm using all default settings, are there any settings that I should tweak(in logstash or in my ES Cluster) to get a better performance?

Thanks.

250 docs/s per instance I suppose? That's slow but not unreasonably slow if you have many filters. Is the pipeline's bottleneck Logstash at all or is it ES that's limiting the throughput?

When only one logstash instance is running, ES can ingest at 250 docs/second. With all 10 instances running I get a throughput of 2000 docs/second. I have a ruby filter in my config that deletes all attributes in the event that are null. Since with all 10 instances running the through is not 2500/second, is it safe to assume that part of the bottleneck is the Cluster? If so, how can I improve that?

Thanks.

What is the specification of your Elasticsearch cluster? How many nodes do you have? How many indices/shards are you actively indexing into?

I have a 4 node cluster. Each node has 64GB memory and 1.2TB disk. I will be actively ingesting into 1 index at a time with about 30 Million documents(there are a total of 10 such indices which need to be refreshed and total number of docs is about 250-350Million) and the number of shards is 32. The number of shards is a lot but with testing I have found that the query performance is ideal for this index with 32 shards(there are about 10-15 different aggregations per query).

Thanks.

Assuming you are on a recent version of Logstash, could you please try starting it with an internal pipeline size (-b) of e.g. 1000 or 2000 to see if that makes a difference?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.