I want to gather some advise on a problem I'm facing right now. I want/have to index millions of rows from a postgres instance in elasticsearch.
Do give some scale, let's assume this data are products. A product can belong to exactly 1 collection and is created by a user.
- Products: 52.000.000
- Collections: 9.000.000
- Users: roughly 9.000.000 as well
To allow easier filtering, I also want the info how many products/collections a user has, and how many products are inside 1 collection.
For the implementation, I am using logstash with the jdbc plugin. I am running 1 logstash instance for each of the above described types for parallel importing.
The problem is, it's just very slow and I am not sure how to speed it up in the correct way. Disk / CPU on the ES instance seems fine, the postgres is scratching it's CPU limit with a lot of IO waits.
The problem is the sheer amount of data. This is the query I'm using for users:
SELECT users.*, (SELECT count(*) FROM products product WHERE product.user_id = user.id ) AS product_count, (SELECT count(*) FROM collections collection WHERE collection.user_id = user.id ) AS collection_count FROM users user
all *_id fields are indexed with a btree. Using explain:
Limit (cost=0.00..1565.01 rows=1 width=41) -> Seq Scan on users user (cost=0.00..14549429167.11 rows=9296702 width=41) SubPlan 1 -> Aggregate (cost=1415.84..1415.85 rows=1 width=0) -> Bitmap Heap Scan on products product (cost=7.32..1414.95 rows=355 width=0) Recheck Cond: (user_id = user.id) -> Bitmap Index Scan on product_user_id_index (cost=0.00..7.23 rows=355 width=0) Index Cond: (user_id = user.id) SubPlan 2 -> Aggregate (cost=149.13..149.14 rows=1 width=0) -> Index Only Scan using collections_user_id_index on collections collection (cost=0.43..149.02 rows=44 width=0) Index Cond: (user_id = user.id)
When timing the read queries:
- LIMIT 1: 0.695ms
- LIMIT 10: 10434ms
- LIMIT 100: 150471ms
Based on this information I'm currently using a low page size (page_size: 10, fetch_size: 100) to now have very long running queries but these times make importing into ES a very slow and tiresome process. Any tips in how to get the data out faster?
If I were to beef up the DB machine, would adding more CPUs help? AFAIK postgres doesn't execute queries on multiple cores so I am not sure how much that would help.
(Also slightly related, but how come the
count() for collections uses a index only scan, while products uses a bitmap heap scan instead?)