How to ingest very large json file into elastic search quickly

mradulag · June 21, 2024, 3:32am

My setup:
I have a very large line separated JSON file in the order of TBs. I want to ingest this file as quickly as possible into my Elasticsearch server, I have enough hardware requirements and a cluster setup using Kubernetes.

Currently I am sending my file as follows:

apiVersion: logstash.k8s.elastic.co/v1alpha1
kind: Logstash
metadata:
  name: quickstart
spec:
  count: 1
  elasticsearchRefs:
    - name: quickstart
      clusterName: quickstart
  version: 8.14.1
  pipelines:
    - pipeline.id: main
      config.string: |
        input {
          file {
            path => "....."
            start_position => "beginning"
            codec => "json" 
            sincedb_path =>  ".sincedb1"
          }
        }
        output {
          elasticsearch {
            hosts => [ "" ]
            user => ""
            password => ""
            ssl_verification_mode => "none"
            index => "logstash-5"
            manage_template => true
            template => "...."
            template_name => "logstash"
            template_overwrite => true
          }
        }
  podTemplate: 
    spec: 
      containers:
      - name: logstash
        resources:
          requests:
            memory: 128Gi
            cpu: 32
          limits:
            memory: 128Gi
            cpu: 32
        env:
        - name: LS_JAVA_OPTS
          value: "-Xms128g -Xmx128g"
        - name: PIPELINE_BATCH_SIZE
          value: "1024"
        - name: PIPELINE_WORKERS
          value: "32"
        volumeMounts:
        - name: host-volume
          mountPath: /mnt/logstash/
          readOnly: false 
      volumes:
        - name: host-volume
          hostPath:
            path: /.....
            type: DirectoryOrCreate

I have tried many changes such as modifying batch size, number of workers, number of shards, memory, cpu, jvm memory and some other parameters. The maximum ingestion rate I am obtaining is around 0.1 million documents per seconds. I want to scale it to atleast 10 million docs per seconds.

Do logstash properly parallelised the ingestion or I am missing something? I can split the file in multiple parts, but how do I configure logstash to ingest those smaller files in parallel?

leandrojmp · June 21, 2024, 3:54am

Hello,

A couple of things that need to be clear, the ingestion rate you will get will depend way more on the speed of the source disk and the specs of your Elasticsearch nodes.

I would say that in the majority of the issues with ingestion the bottleneck is Elasticsearch that cannot process the events fast enough.

What are the specs of your Elasticsearch nodes? CPUs, RAM, Heap and specially the disk type. Have you already read the doc on how to tune Elasticsearch for indexing speed?

0.1 million events per second is 100k e/s, which is a pretty good event rate, 10 million is pretty heavy and very hard to reach without a lot of resources on Elasticsearch side.

The filters and output will be run in parallel, but not the file input, so if you have a big file, logstash will read it sequentially.

If you split the big file in multiple smaller files, you would need to use multiple different file inputs, each one of them read from some files or file patterns, but keep in mind that if they are in the same source disk it can have the opposite effect as you would have more I/O requests which could make things slower.

Is this a one time ingestion or something that needs to be done in real time on this same scale?

mradulag · June 21, 2024, 4:03am

I have tuned Elastic search and have tried multiple tuning options the document mentions. This is going to be a one time thing mostly and I just want to quickly upload my input data to the server.

The benchmark of 10 million docs/s I am mentioned about was with esrally bench marking tool. But it is not standard and a bit messy to use, therefore I tried logstash for this purpose.

I have multiple nodes setup in a kubernetes cluster with 200 cores and Tbs of RAM with very fast disk as well. Therefore it is not going to be a system bottleneck anywhere (I was able to achieve the above rate using esrally anyways)

Can you provide a more general solution to this problem (not even with Logstash), my target is to just ingest the a very large file into the server.

leandrojmp · June 21, 2024, 4:37am

What options have you changed? In your other question about a template, your template didn't change the default refresh_interval which can heavily impact on performance.

What is the refresh_interval that you set for your index? Also, did you create a mapping for your data or is using the dynamic mapping? This also impacts on performance.

Please share the specs of your Elasticsearch, the elasticsearch.yml and the tuning options you changed, without knowing what you already did it is complicated to provide any feedback.

Also, how many TBs are you talking about in this file? 10s TBs? 100s TBs?

Another thing, is this mount path a network share?

mountPath: /mnt/logstash/

This is no trivial issue, there are many variables, disk speed, tuning configurations, network speed etc.

I do not use Logstash to read files anymore, I normally put all my data in some Kafka clusters and have multiple logstash reading from the Kafka topics, but to get your data into Kafka you would also need some tool to read the file and put it on the topics.

Not sure if this is justified for a one time ingestion or even if this would change anything.

Topic		Replies	Views
How to ingest large files in ELK stack Logstash	1	142	June 24, 2024
Ingesting large number of files in a directory using logstash Logstash ingest-pipeline	4	931	April 21, 2021
Fastest way to ingest CSV's with logstash to elasticsearch Logstash	9	473	June 8, 2023
Tuning to handle extreme initial ingestion conditions (with logstash) Elasticsearch	4	993	July 27, 2019
How to Ingest MultiLine Json file into ElasticSearch using Logstash Pipeline Logstash	1	212	March 27, 2023

How to ingest very large json file into elastic search quickly

Related topics