PostgreSQL删除重复数据的几种方法
Oracle 去重的方法很多,常用的是根据 rowid 进行去重,PostgreSQL 如何去除单表重复数据呢?可以通过 ctid 进行,下面是实验过程。
一般方法示例
一、创建测试表
1 2 3 4 5 | david=# create table emp ( david(# id int, david(# name varchar); CREATE TABLE david=# |
二、插入测试数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | david=# insert into emp values (1, 'david'); INSERT 0 1 david=# insert into emp values (1, 'david'); INSERT 0 1 david=# insert into emp values (1, 'david'); INSERT 0 1 david=# insert into emp values (2, 'sandy'); INSERT 0 1 david=# insert into emp values (2, 'sandy'); INSERT 0 1 david=# insert into emp values (3, 'renee'); INSERT 0 1 david=# insert into emp values (4, 'jack'); INSERT 0 1 david=# insert into emp values (5, 'rose'); INSERT 0 1 david=# |
三、查询初始化数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | david=# select ctid, * from emp; ctid | id | name -------+----+------- (0,1) | 1 | david (0,2) | 1 | david (0,3) | 1 | david (0,4) | 2 | sandy (0,5) | 2 | sandy (0,6) | 3 | renee (0,7) | 4 | jack (0,8) | 5 | rose (8 rows) david=# |
查询重复数据数:
1 2 3 4 5 6 7 8 | david=# select distinct id, count(*) from emp group by id having count(*) > 1; id | count ----+------- 1 | 3 2 | 2 (2 rows) david=# |
查询出 id 为1的记录有3条,id 为2的记录有2条。
四、查询要保留的数据
以 min(ctid) 或 max(ctid) 为准。
1 2 3 4 5 6 7 8 9 10 11 | david=# select ctid, * from emp where ctid in (select min(ctid) from emp group by id); ctid | id | name -------+----+------- (0,1) | 1 | david (0,4) | 2 | sandy (0,6) | 3 | renee (0,7) | 4 | jack (0,8) | 5 | rose (5 rows) david=# |
五、删除重复数据
1 2 3 | david=# delete from emp where ctid not in (select min(ctid) from emp group by id); DELETE 3 david=# |
六、查看最后结果
1 2 3 4 5 6 7 8 9 10 11 | david=# select ctid, * from emp; ctid | id | name -------+----+------- (0,1) | 1 | david (0,4) | 2 | sandy (0,6) | 3 | renee (0,7) | 4 | jack (0,8) | 5 | rose (5 rows) david=# |
说明:如果表中已经有标明唯一的序列主键值,可以把该值替换上述的ctid直接删除。
优化方法
首先创建一张基础表,并插入一定量的重复数据,数据量51万,重复数据1万。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | drop table deltest,deltest2,deltest3,deltest_bk; create table deltest(id int, name varchar(255)); create table deltest_bk (like deltest); create table deltest2 (like deltest); create table deltest3 (like deltest); insert into deltest select generate_series(1, 500000), 'ZhangSan'; insert into deltest select generate_series(1, 10000), 'ZhangSan'; insert into deltest_bk select * from deltest; insert into deltest2 select * from deltest_bk; insert into deltest3 select * from deltest_bk; checkpoint; \timing on |
常规删除方法(非常不推荐)
最容易想到的方法就是判断数据是否重复,对于重复的数据只保留ctid最小(或最大)的那条数据,删除其他的数据。
1 2 3 4 | postgres=# delete from deltest a where a.ctid <> (select min(t.ctid) from deltest t where a.id=t.id); Cancel request sent ERROR: canceling statement due to user request Time: 904323.619 ms (15:04.324) |
可以看到,id相同的数据,保留ctid最小的那条,其他的删除。
跑了15分钟,没运行出结果,我取消了!!!
group by删除方法
第二种方法为group by方法,通过分组找到ctid最小的数据,然后删除其他数据。
1 2 3 | postgres=# delete from deltest2 a where a.ctid not in (select min(ctid) from deltest2 group by id); DELETE 10000 Time: 2779.373 ms (00:02.779) |
可以看到,使用group by的方式,跑了2分。
分析函数
下面试一下第三种删除操作。
1 2 3 4 | postgres=# delete from deltest3 a where a.ctid = any(array (select ctid from ( postgres(# select row_number() over (partition by id), ctid from deltest3) t where t.row_number > 1)); DELETE 10000 Time: 849.429 ms |
800毫秒,1秒不到就跑完了,是不是超级快!!!
总结
1、使用如下命令删除是最快速的方法:
1 2 | delete from deltest3 a where a.ctid = any(array (select ctid from ( select row_number() over (partition by id), ctid from deltest3) t where t.row_number > 1)); |
2、以下方法对于大数据量时效率不好:
1 2 3 4 5 | -- 还好,一般方法 delete from deltest2 a where a.ctid not in (select min(ctid) from deltest2 group by id); -- 非常不推荐这种写法 delete from deltest a where a.ctid <> (select min(t.ctid) from deltest t where a.id=t.id); |
3、更优化的方法:
更优化的方法就是减少数据扫描的块数,例如,对于此案例,可以加上where条件where id<=10000
,可以修改为如下SQL:
1 2 3 4 5 | delete from deltest3 a where a.ctid = any(array (select ctid from ( select row_number() over (partition by id), ctid from deltest3 where id<=10000) t where t.row_number > 1)) and id<=10000; delete from deltest2 a where a.ctid in (select min(ctid) from deltest2 where id<=10000 group by id) and id<=10000; |
执行时间:
1 2 3 4 5 6 7 | postgres=# delete from deltest3 a where a.ctid = any(array (select ctid from ( postgres(# select row_number() over (partition by id), ctid from deltest3 where id<=10000) t where t.row_number > 1)) and id<=10000; DELETE 10000 Time: 210.894 ms postgres=# delete from deltest2 a where a.ctid in (select min(ctid) from deltest2 where id<=10000 group by id) and id<=10000; DELETE 10000 Time: 274.759 ms |
可以看到,加上where后,这2种时间差不多。