While trying to find the fastest way to do a bulk update of a large table, I came up with the following plan:
having main_table
(potentially above 1000 mln rows) and update
table up to a few million rows. Data in update
are indexed according to main_table
(have corresponding id
columns).
- copy rows from
main_table
intotemp_table
whereid
not inupdate
id
table (where data has actually changed). - simple append of the
update
table ontotemp_table
- drop
main_table
- rename
temp_table
tomain_table
- renumerate
main_table
pk
- create the next
update
withpython
and repeat.
The above is to be done with the following sequence of queries:
CREATE TABLE temp_table (like main_table including all);INSERT INTO temp_table (listing_id,date,available,price,timestamp) (SELECT listing_id,date,available,price,timestamp FROM main_table c WHERE NOT EXISTS (SELECT FROM update WHERE id = c.id));INSERT INTO temp_table (listing_id,date,available,price,parsing_timestamp) (SELECT listing_id,date,available,price,timestamp FROM update c);DROP TABLE main_table cascade;ALTER TABLE temp_table RENAME TO main_table;ALTER TABLE main_table DROP COLUMN idALTER TABLE main_table ADD COLUMN id SERIAL PRIMARY KEY;DROP TABLE update cascade;
While debugging the queries, I have found the slow queries with explain
:
EXPLAIN (ANALYZE, BUFFERS) CREATE TABLE temp_table (like main_table including all);INSERT INTO temp_table (listing_id,date,available,price,timestamp) (SELECT listing_id,date,available,price,timestamp FROM main_table c WHERE NOT EXISTS (SELECT FROM update WHERE id = c.id));
This is the result of EXPLAIN (ANALYZE, BUFFERS)
as requested in comments:
QUERY PLANMerge Anti Join (cost=513077.42..4789463.48 rows=109800833 width=40) (actual time=1216.018..48520.564 rows=112757269 loops=1)Merge Cond: (c.id = update.id)Buffers: shared hit=4 read=1359891 written=2838, temp read=8701 written=15077-> Index Scan using cals_pkey on cals c (cost=0.57..3958576.01 rows=113119496 width=44) (actual time=0.857..29573.798 rows=113119497 loops=1)Buffers: shared hit=1 read=1330192 written=2838 -> Sort (cost=513076.85..521373.51 rows=3318663 width=8) (actual time=1215.147..1260.191 rows=362229 loops=1)Sort Key: update.idSort Method: external merge Disk: 58480kBBuffers: shared hit=3 read=29699, temp read=8701 written=15077-> Seq Scan on update (cost=0.00..62881.63 rows=3318663 width=8) (actual time=0.179..423.100 rows=3318663 loops=1)Buffers: shared read=29695Planning Time: 0.259 msExecution Time: 52757.349 ms
How to optimize the queries?