Elastic search to Postgres migration and building Cache for trillions records

I am working on a product design where we provide source data as input which is having details of files and folders. This data is huge and in trillions. This goes for metadata scan and keep the metadata scan details of input files and folders in Elasticsearch database with indexing done. Then we run query service to pull data to Postgres database and this data migration/sync is extremely poor in performance and takes too much time. Also we build cache based on Postgres data which is also huge for better dashboard performance but dashboard performance is also very slow. I am not sure where are the problem in this design? Can you find what is missing in this design and what all changes I should make to improve the performance significantly?

Hi @S8ukumar,

Welcome! To confirm, it sounds like you are seeing slowness both in the ingestion of documents you are syncing and in the queries you are using in the dashboard. We need a bit more information to help:

  1. Which version of Elasticsearch are you using?
  2. What is the size of the cluster you are using?
  3. What technology are you using to ingest the data? Are you using Logstash and JDBC input plugin, or something else? If so can you share the configuration.
    4 .What particular queries on your dashboard are running slow? It might be worth using the Search Profiler to see what's happening with that particular query.

Let us know!

It is not clear to me what the actual flow of data is so it would be good to clarify this.

I interpret this as you first ingest large data volumes into Elasticsearch, is this correct? If so, is this loading performant?

Which version of Elasticsearch are you using?

What is the size and specification of your cluster in terms of node count, CPU, RAM and type of storage used?

It sounds like you then extract data from Elasticsearch and ingest into Postgres. Is this done through aggregations or queries extracting raw documents?

What do the queries/aggregations look like?

What is the size and number of indices queried? What is the average shard size? How much data is extracted?

Where is the cached data stored? Are you referring to Kibana dashboards?

What is the size and structure of data in the cache?

Hi Cary, Thanks for your response. Let me rephrase my problem once again. Then I will answer your question as well.

  1. We do Metadata scan of Trillions of files and capture their Metadata and store it in Elastic search with Indexing.
  2. Once MDS finishes (Metadata scan), the Kafka sends notification to Sync Service AND Sync Service starts pulling data in chunk of 100K data from Elasticsearch, perform ETL using data staging within Postgres DB and start storing in perm tables within PG in series of chunks. This takes lot of time (e.g 40 - 45 min for 500k records)
  3. We run Cache service every mid night to build cache against 600 million records in 24 Hrs (1 Day) using REDIS. This is very time taking as the rate is extremely slow. That's the reason when Front dashboard pulls data it's not latest most of that time as it pulls data from previous cache. This is another major performance lag.

Answers to your question -

  1. Which version of Elasticsearch are you using? 8.15.1
  2. What is the size of the cluster you are using?
    RAM- 30.7GB/31.3GB, HEAP 1.1GB/15.6GB , DISK 84.7GB/144.6GB
  3. What technology are you using to ingest the data? Are you using - Explained as above. If so can you share the configuration. Kindly suggest any tool to use as per above flow to improve performance.
    4 .What particular queries on your dashboard are running slow? It's filter which is set on dashboards and based on those it's come to fetch data based on API request.

Thanks Chris for your response -
Let me rephrase my problem once again. Then I will answer your question as well.

  1. We do Metadata scan of Trillions of files and capture their Metadata and store it in Elastic search with Indexing.
  2. Once MDS finishes (Metadata scan), the Kafka sends notification to Sync Service AND Sync Service starts pulling data in chunk of 100K data from Elasticsearch, perform ETL using data staging within Postgres DB and start storing in perm tables within PG in series of chunks. This takes lot of time (e.g 40 - 45 min for 500k records)
  3. We run Cache service every mid night to build cache against 600 million records in 24 Hrs (1 Day) using REDIS. This is very time taking as the rate is extremely slow. That's the reason when Front dashboard pulls data it's not latest most of that time as it pulls data from previous cache. This is another major performance lag.

Answers to your question -

  1. Which version of Elasticsearch are you using? 8.15.1
  2. What is the size of the cluster you are using?
    RAM- 30.7GB/31.3GB, HEAP 1.1GB/15.6GB , DISK 84.7GB/144.6GB
  3. What technology are you using to ingest the data? Are you using - Explained as above. If so can you share the configuration. Kindly suggest any tool to use as per above flow to improve performance.
    4 .What particular queries on your dashboard are running slow? It's filter which is set on dashboards and based on those it's come to fetch data based on API request.

I do not understand why you have provided exactly the same answer twice. Seems redundant. I do have some further questions though:

How many node do you have in your cluster?

What type of storage are you using? Is it local SSD?

Why is heap listed as 2 values? You should always set min and max heap size to the same value.

What is indexing this data? Are you using bulk requests? Is this performing OK?

Am I reading this correctly that Kafka contains pieces of work that corresponds to queries that will be run against Elasticsearch?

It sounds based on the description like you are running queries and extracting the raw documents, is that correct?

What is the average document size?

What does the queries you run look like? Are they simple or complex? What kind of query clauses do they use?

How are you pulling the data out? Are you using PIT and search after? What batch size are you using? Are you ordering the data being retrieved?

What type and level of parallelism are you using here?

How have you gone about verifying that it is getting the data out of Elasticsearch and not ingesting it into Postgres that is the bottleneck?

Exactly how is this done? What type of query are you using? What is the type and level of parallelism?

What is the level of throughput you are seeing?

Have you looked at disk I/O on the Elasticsearch nodes, e.g. using iostat -x, while you are extracting data to check if that may be the limiting factor? Reading lots of data from disk will often result in lots of small random reads which requires fast disks.

You have not described what tools you are actually using, just the flow of data. It would be useful if you could provide some deatils around the toolset used.

If data is coming out of Elasticsearch faster than it can go into Postgres it sounds to me like the problem is with Postgres. If that is the case I do not think we can help here.

What analysis are you doing in Postgres that you can not do in Elasticsearch? Might it be possible to simplify the whole process?