Indexing 100 Million records from postgresdb

I just want to know what specs and configuration do I need to index 100Million+ records, Right now I have free memory of:

total used free shared buff/cache available
Mem: 64410 2782 25389 13663 36237 47315

I was able to index couple of millions of records but on this one it has a huge amount of data, I have the heap space increased by 16GB.

I am using ubuntu 18.x

Here's my config:

# file: my.conf
input {
jdbc {
    jdbc_connection_string => "jdbc:postgresql://localhost/db?user=[USER]&password=[PASS]"
    jdbc_user => "postgres"
    jdbc_driver_library => ""
    jdbc_driver_class => "org.postgresql.Driver"
    statement => "SELECT a.app_id, a.text, CASE WHEN b.granted = 'f' THEN 0 ELSE 1 END as isgrant FROM tbl1 as a LEFT JOIN tbl2 as b ON a.app_id = b.app_id"
}
  }
  output {
  elasticsearch {
      index => "idx"
      document_id => "%{app_id}"
      hosts => "http://localhost:9200"
  }
}

Please help.

There is no way to answer that question. A t2.micro single-CPU cloud instance with 500 MB of heap can feed a billion records to elasticsearch, although it will probably take longer than a larger instance (unless your elasticsearch instance is the bottleneck).

If you are not using any filters and just pulling data out of jdbc I would not expect you to need a massive amount of memory.

I suggest you experiment. Measure the number of documents that you can index per second on a given set of hardware. Decide if it is fast enough. If not, start figuring out whether the bottleneck is the database, logstash, or elasticsearch and tune that subsystem. Then measure again and repeat. Nobody else can do this for you.

2 Likes

Thanks! this means more work for me :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.