Help with indexing millions of rows through postgres -> logstash -> elasticsearch

Hi!

I want to gather some advise on a problem I'm facing right now. I want/have to index millions of rows from a postgres instance in elasticsearch.

Do give some scale, let's assume this data are products. A product can belong to exactly 1 collection and is created by a user.

  • Products: 52.000.000
  • Collections: 9.000.000
  • Users: roughly 9.000.000 as well

To allow easier filtering, I also want the info how many products/collections a user has, and how many products are inside 1 collection.

For the implementation, I am using logstash with the jdbc plugin. I am running 1 logstash instance for each of the above described types for parallel importing.

The problem is, it's just very slow and I am not sure how to speed it up in the correct way. Disk / CPU on the ES instance seems fine, the postgres is scratching it's CPU limit with a lot of IO waits.

The problem is the sheer amount of data. This is the query I'm using for users:

          SELECT
            users.*,
            (SELECT
              count(*)
            FROM
              products product
            WHERE
              product.user_id = user.id
            ) AS product_count,
            (SELECT
              count(*)
            FROM
              collections collection
            WHERE
              collection.user_id = user.id
            ) AS collection_count
          FROM
            users user

all *_id fields are indexed with a btree. Using explain:

 Limit  (cost=0.00..1565.01 rows=1 width=41)
   ->  Seq Scan on users user  (cost=0.00..14549429167.11 rows=9296702 width=41)
         SubPlan 1
           ->  Aggregate  (cost=1415.84..1415.85 rows=1 width=0)
                 ->  Bitmap Heap Scan on products product  (cost=7.32..1414.95 rows=355 width=0)
                       Recheck Cond: (user_id = user.id)
                       ->  Bitmap Index Scan on product_user_id_index  (cost=0.00..7.23 rows=355 width=0)
                             Index Cond: (user_id = user.id)
         SubPlan 2
           ->  Aggregate  (cost=149.13..149.14 rows=1 width=0)
                 ->  Index Only Scan using collections_user_id_index on collections collection  (cost=0.43..149.02 rows=44 width=0)
                       Index Cond: (user_id = user.id)

When timing the read queries:

  • LIMIT 1: 0.695ms
  • LIMIT 10: 10434ms
  • LIMIT 100: 150471ms

Based on this information I'm currently using a low page size (page_size: 10, fetch_size: 100) to now have very long running queries but these times make importing into ES a very slow and tiresome process. Any tips in how to get the data out faster?

If I were to beef up the DB machine, would adding more CPUs help? AFAIK postgres doesn't execute queries on multiple cores so I am not sure how much that would help.
(Also slightly related, but how come the count() for collections uses a index only scan, while products uses a bitmap heap scan instead?)

Bumping this in hope someone can help

This seems to be a question around optimising/tuning your database, query and data data extraction speed rather than anything related to Elasticsearch or Logstash performance. I am therefore not sure this is the best forum for this, which might be why you have not received any responses. You may be better served asking this at e.g. StackOverflow.

Completely agree. I cross-posted this question to stackexchange as well.

But I thought ES users probably deal with a lot of data on a day to day base, and importing from a DB is such a common task so there might be some people here that solved this problem already.

To add some more information: I tried a few different things in the meantime and it seems that on one side the count is what's making it so slow, on the other side the OFFSET. Having small pages seemed like a good idea but because of how postgres handles OFFSET, it becomes almost unusably slow later on.

As an example, having LIMIT 1 and OFFSET 1000000 has similar performance to LIMIT 1000000. So with a small limit, it's almost like you have a massive cost with every OFFSET step (later on).

Because of that, it seems like it's better to have very very big pages to have the OFFSET cost as little as possible.