Hi!
I want to gather some advise on a problem I'm facing right now. I want/have to index millions of rows from a postgres instance in elasticsearch.
Do give some scale, let's assume this data are products. A product can belong to exactly 1 collection and is created by a user.
- Products: 52.000.000
- Collections: 9.000.000
- Users: roughly 9.000.000 as well
To allow easier filtering, I also want the info how many products/collections a user has, and how many products are inside 1 collection.
For the implementation, I am using logstash with the jdbc plugin. I am running 1 logstash instance for each of the above described types for parallel importing.
The problem is, it's just very slow and I am not sure how to speed it up in the correct way. Disk / CPU on the ES instance seems fine, the postgres is scratching it's CPU limit with a lot of IO waits.
The problem is the sheer amount of data. This is the query I'm using for users:
SELECT
users.*,
(SELECT
count(*)
FROM
products product
WHERE
product.user_id = user.id
) AS product_count,
(SELECT
count(*)
FROM
collections collection
WHERE
collection.user_id = user.id
) AS collection_count
FROM
users user
all *_id fields are indexed with a btree. Using explain:
Limit (cost=0.00..1565.01 rows=1 width=41)
-> Seq Scan on users user (cost=0.00..14549429167.11 rows=9296702 width=41)
SubPlan 1
-> Aggregate (cost=1415.84..1415.85 rows=1 width=0)
-> Bitmap Heap Scan on products product (cost=7.32..1414.95 rows=355 width=0)
Recheck Cond: (user_id = user.id)
-> Bitmap Index Scan on product_user_id_index (cost=0.00..7.23 rows=355 width=0)
Index Cond: (user_id = user.id)
SubPlan 2
-> Aggregate (cost=149.13..149.14 rows=1 width=0)
-> Index Only Scan using collections_user_id_index on collections collection (cost=0.43..149.02 rows=44 width=0)
Index Cond: (user_id = user.id)
When timing the read queries:
- LIMIT 1: 0.695ms
- LIMIT 10: 10434ms
- LIMIT 100: 150471ms
Based on this information I'm currently using a low page size (page_size: 10, fetch_size: 100) to now have very long running queries but these times make importing into ES a very slow and tiresome process. Any tips in how to get the data out faster?
If I were to beef up the DB machine, would adding more CPUs help? AFAIK postgres doesn't execute queries on multiple cores so I am not sure how much that would help.
(Also slightly related, but how come the count()
for collections uses a index only scan, while products uses a bitmap heap scan instead?)