Loading high transactional data to elasticsearch

Hi all,
I'm ingesting network traffic using tshark and need to load it to elasticsearch for further analysis and troubleshooting. The traffic is transformed to json format and around 1TB daily, my bottleneck is now logstash which cannot load the data to elasticsearch and finish even half of the data in a day. I've adjusted the pipeline.batch.size to 1000 and pipeline.workers to 200, it did not help alot. is there any alternative to logstash for huge data indexing to elasticsearch or a tuning which I may apply on logstash

Hello and welcome,

How did you arrive to this conclusion? Logstash event rate will depends on Elasticsearch, if your Elasticsearch cluster cannot index data fast enough, it will tell Logstash to backoff.

It is way more probable that your bottleneck is your Elasticsearch, not Logstash.

What are your Logstash specs and your Elasticsearch specs? How many cores, memory and disk type?

Also, share your logstash configuration.

1 Like

if your elastic cluster is not a issue then you can use proxy and multiple logstash to load data.

I had this issue once. elastic was fast but logstash was not able to handle load. once I added proxy -> logstash1, logstash2 etc.. all worked out.

What are your Logstash specs and your Elasticsearch specs? How many cores, memory and disk type? Also, share your logstash configuration.

These are obviously the right Qs to ask. Also how many nodes in the elasticsearch and logstash clusters, and just check the ingest load is being shared across the clusters properly.

Just for the record, 1TB /day is not really "huge data indexing" at all. Thats ca: 12 MB/s, 12k docs/s if doc size is 1k. There's a thread from 2018 where someone has a little moan about indexing only 50k documents per second on a single node.

Because logstash is consuming a lot of resources especially CPU

I have 750GB RAM and 80 Core CPUs

Here you are the configuration

#pcap_gy_json.conf
input {
 file {
  path => "/home/elk/pcap/gyJson/*.json"
  start_position => "beginning"
  sincedb_path => "/home/elk/sincedb_gy"
  file_chunk_size => 1000000
  file_completed_action => "delete"
  mode => read
  codec => "json"
 }
}

filter {


mutate {
    rename => {
      "[layers][diameter_flags_request]" => "isRequest"
      "[layers][diameter_CC-Request-Type]" => "CC-Request-Type"
      "[layers][diameter_Session-Id]" => "Session-Id"
      "[layers][e164_msisdn]" => "msisdn"
      "[layers][e212_imsi]" => "imsi"
      "[layers][diameter_Result-Code]" => "Result-Code"
      "[layers][diameter_Rating-Group]" => "Rating-Group"
      "[layers][diameter_3GPP-Reporting-Reason]" => "3GPP-Reporting-Reason"
      "[layers][diameter_3GPP-SGSN-MCC-MNC]" => "3GPP-SGSN-MCC-MNC"
      "[layers][diameter_CC-Total-Octets]" => "CC-Total-Octets"
      "[layers][ip_src]" => "ipSrc"
      "[layers][ip_dst]" => "ipDst"
      "[layers][diameter_resp_time]" => "respTime"
      "[layers][frame_time_epoch]" => "timestamp"
      "[layers][diameter_Called-Station-Id]" => "Called-Station-Id"
      "[layers][diameter_3GPP-Charging-Characteristics]" => "3GPP-Charging-Characteristics"
      "[layers][diameter_Validity-Time]" => "Validity-Time"
      "[layers][diameter_Origin-Realm]" => "Origin-Realm"
      "[layers][diameter_Destination-Realm]" => "Destination-Realm"
      "[layers][diameter_3GPP-Charging-Id]" => "3GPP-Charging-Id"
      "[layers][diameter_CC-Request-Number]" => "CC-Request-Number"
      "[layers][diameter_SGSN-Address_IPv4]" => "SGSN-Address"
    }
    remove_field => ["message", "event", "host", "tags", "@version"]
}

date {
   match => ["timestamp", "UNIX"]
   target => "@timestamp"
}
ruby {

code => "
      event.set('local_date', event.get('@timestamp').time.localtime.strftime('%Y%m%d%H'))
    "
}


}

output {
 elasticsearch {
  hosts => ["127.0.0.1:9200"]
  index => "pcap-gy-%{local_date}"
  user => "elastic"
  password => "*"
  ssl => true
  ssl_certificate_verification => false
 }
}
#pipelines.yml
- pipeline.id: pcap_gy_json
  path.config: "/home/elk/logstash-8.12.2/conf.d/pcap-gy-json.conf"
  pipeline.batch.size: 2048
  pipeline.workers: 400

This alone does not mean that Logstash is your issue, it could be a sympton, not the cause.

What is the disk type? HDD or SSD? Also, you need to share your Elasticsearch specs as well, number of nodes, CPU, RAM and disk type.

Also, share your logstash.yml as well, it is not clear if you are using persistent queues or memory queues.

This seems exagerated, specially since you are having performance issues, you should increase pipeline.workers past your CPU cores only if your CPU isn't saturated, which does not seem to be the case here.

I would recommend to decreasing it to 80, which is your CPU core number while troubleshooting this issue.

thanks for your feedback
elasticsearch is working on the same host, sharing same resources. The is
an 8TB lun disk mounted from storage running on SSD disks
logstash.yml is default and except below line all other lines are commented

config.reload.automatic: true

Yeah, this can be pretty bad for performance as both Logstash and Elastic can be resource intensive and will compete for resources.

However on a 750 GB RAM machine with 80 Cpu cores I would not expect issues, unless the disk is not fast enough.

How is the load on you server? Can you share a screenshot of htop or some similar tool?

Is the storage shared by other hosts? It is not clear if you are running this on a bare metal or a VM.

I dont know how well logstash and ruby interact, but is there not a cleverer way here?

Way it was written, I am guessing this is from a SAN storage device.

iostat output over a few minutes would be useful.

Running logstash and elasticsearch on same machines in production is not something I'd considered, but then I dont have a 80 core / 750GB RAM server :smile:

Would there be advantage to fix/limit the JVM memory size via options to the JVMs, rather than let the OS referee/manage that?

elasticsearch.yml I have below configuration
http.max_content_length: 500mb

I'm running it on a bare metal, Today I've added a elasticsearch node to reduce the in another host to reduce the IO, but the speed seems to be decreased instead

Below is my jvm options for logstash, on elasticsearch I didn't define it, so it follows the default. would you let me know your suggestion for the value of jvm?

-Xms4g
-Xmx8g

Xms is the starting size of the JVM memory allocation pool.
Xmx is the maximum size of the JVM memory allocation pool.

8g as a max seems very low.

I'd suggest increasing these by a lot if you are running on the big memory machine.

I was more thinking on situation where both logstash and elasticsearch were on same machine, sort of allocating a bunch of RAM to each, and leaving the OS manage a more limited file system cache. But certainly this seems sub-optimal to me.

In the atop output you shared, there was a lot of RAM assigned under "Cache", 571G, and 95G free, so your logstash/elasticsearch were not having access to much of the massive RAM.

sdd is obviously where most of the IO is happening, 86% busy and 632 MBw/s. Less reading than writing, but still a fair amount.

you can't go more then 31gig. setup logstash and elastic both 31gig as you have plenty

31g for logstash? because as per elastic documentation for elaticsearch jvm can be increased to 32g but logstash it has suggested not to go more than 8g.

From what you shared, your issue does not seem to be related to memory, but to io, probably your disk is not fast enough or you need to tune both elasticsearch and logstash.

You have a pretty powerful machine, with a lot of RAM and CPU, but you still would need to tune Elasticsearch and Logstash, the main issue is that tuning both on the same machine is a little more complicated.

In the screenshot you shared even with 80 Cores your server is under heavy load, the 1 minute average is higher than 100 and the 15 minute average is higher than 90, this suggests an IO issue with your disk not being able to keep up with the write/read requests.

Can you run an iostat on your data disks? Something like iostat -m -p sdd.

Also, you said you didn't define the java heap for Elasticsearch, it will then use half of your memory, which can be an issue because there is a recommendation to not go over something around ~ 30GB, which is the threshold for compressed ordinary object pointers (oops) in java.

You can check if you went over this running GET _nodes/_all/jvm in Kibana Dev Tools and locking for this line:

"using_compressed_ordinary_object_pointers": "true"

It should be true.

A couple of question/suggestions:

Did you decrease the number of pipeline.workers? Your CPU is heavely satured, and you were using very high number of workers, I would recommend that you change it back to the default, which is the number of cores. It is not clear if you changed that already.

What is the filesystem of your data disk? The one from where logstash is reading and the one configured as the data path for Elasticsearch? Is it ext4? xfs? btrfs? Depending on the file system and the mount options you have room for improvement.

Please share the file system and the mount options being used.

As the Logstash help, set Xms and Xmx to the same value, you can use 8GB in this case.

Let's read the logstash recommendation:

The recommended heap size for typical ingestion scenarios should be no less than 4GB and no more than 8GB."

Reminder that the initial question was posed with both logstash and elasticsearch running on same and single host, with the massive 750GB RAM and 80 cores. I would not call that a "typical ingestion scenario", YMMV.

Thats not to say the recommendation should be ignored, but it does suggest the scenario here is already pretty far from best practice. As I wrote, before I knew that both logstash and elasticsearch were running one same single host

"Also how many nodes in the elasticsearch and logstash clusters, and just check the ingest load is being shared across the clusters properly"

This remains good advice IMHO - best overall performance is better achieved horizontally and not vertically.

I agree with all of that (what Leandro wrote) with exception that "probably your disk is not fast enough" is perhaps true. But even if so, there is often not much the poster can do to make a single LUN from a typical SAN significantly quicker. I also suggest collecting stats with iostat, as well as benchmarking the storage itself (while not in use by the things) with tools like fio/bonnie++.

Obviously things like striping data over multiple LUNs, and multiple controllers or paths, can also help in case of IO constraints

If you can separate out the logstash flow to another server for testing, e.g. send all the data or read all the files on another physical server, you can there pipe the logstash output to /dev/null instead of Elasticsearch to get an estimate what the actual achievable logstash throughput on that server, with its config, would be. Then play around with different logstash/jvm/linux options, and seeing what's best for your specific data.

Thanks for the feedback, here you are the iostat

Linux 4.12.14-122.186-default         12/12/2024      _x86_64_        (80 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.12    0.11    1.29    0.15    0.00   83.34

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sdd             168.86         2.45        31.77   36062442  468397317

given parameter is false in my node

"using_compressed_ordinary_object_pointers": "false"

pipeline.workers is now 80 and my filesystem is ext4

jvm options on both logstash and elaticsearch I've set to 8g and 32g respectively, while no improvement is observed

This needs to be true, you will need to decrease the HEAP for Elasticsearch, try to use 30GB and restart the node and check again.

And the mount options? You didn't share. Check if it is mounted with relatime or atime, it needs to be mounted with noatime, so you may need to change the mount options and reboot your system.

This will not exactly solve your issue and may not improve much, but it is the basic things to be done before anything else.

As mentioned, tuning Elasticsearch and Logstash is not using, trying to do both at the same time on the same machine it is more complicated.

Another quick thing you can do in Elasticsearch side is change the refresh_interval of your index.

Do you have a template for the indices pcap-gy-* ? If so, what is the value of index.refresh_interval? The default is 1s, which can impact a lot in the performance, on my indices I change it to at least 15s, and the majority of them have it set as 30s.

And Just to be clear, you have only that one pipeline you shared right?