取重复记录中的任一条的分析
最近一直考虑一个问题,找出表中同一字段重复记录中的任意一条,模拟的环境如下:
create table t1_t(a int,b varchar2(50));
insert into t1_t(a,b) (select mod(rownum,50) rn,object_name from user_objects);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
commit;
至此t1_t表大致上有300万条记录左右,其中a字段的取值都是0-49之间的integer,现在的目标就是取到50条记录,a分别等于0~49,具体哪一条都可以,但需要同时取出b字段。
一、方法大致有两种:
1、使用row_number() over函数找到首条记录选出
select * from
(select a,b,row_number() over(partition by a order by 1) k from t1_t)
where k=1;
实际执行时间:在软解析的情况下需要3s左右。执行计划如下:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=32068 Card=3632040 Bytes=192498120)
2 1 WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
Statistics
----------------------------------------------------------
165 recursive calls
22 db block gets
17303 consistent gets
1937 physical reads
75904 redo size
2682 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
4 sorts (memory)
1 sorts (disk)
50 rows processed
2、两层关联,用rowid别名进行匹配:
select t1.* from t1_t t1,(select min(rowid) rid from t1_t group by a) t2
where t1.rowid = t2.rid;
实际执行时间:软解析情况下1.5s左右。执行计划如下,比上面的方法要好很多,主要是由于不需要进行排序的操作:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3919 Card=50 Bytes=2000)
1 0 NESTED LOOPS (Cost=3919 Card=50 Bytes=2000)
2 1 VIEW (Cost=3869 Card=50 Bytes=600)
3 2 HASH (GROUP BY) (Cost=3869 Card=50 Bytes=750)
4 3 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3243 Card=3632040 Bytes=54480600)
5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
16278 consistent gets
0 physical reads
0 redo size
2638 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
二、考虑使用索引
首先增加a、b字段的索引:
create
index
t1_t_a
on
t1_t(a);
create
index
t1_t_b
on
t1_t(b);
要使用索引,需要将字段设置为not null,或在SQL中使用not null选项才可以:
alter
table
t1_t
modify
a
not
null
;
alter
table
t1_t
modify
b
not
null
;
1、对于第1种方法,执行计划完全没变,执行效率略有提高
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=32068 Card=3632040 Bytes=192498120)
2 1 WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
Statistics
----------------------------------------------------------
2 recursive calls
22 db block gets
16228 consistent gets
21 physical reads
0 redo size
2682 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
1 sorts (disk)
50 rows processed
2、对于第2中方法,用INDEX FAST FULL SCAN 代替了 TABLE ACCESS FULL SCAN,执行效率也略有提高
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2114 Card=50 Bytes=2000)
1 0 NESTED LOOPS (Cost=2114 Card=50 Bytes=2000)
2 1 VIEW (Cost=2064 Card=50 Bytes=600)
3 2 HASH (GROUP BY) (Cost=2064 Card=50 Bytes=750)
4 3 INDEX (FAST FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=1439 Card=3632040 Bytes=54480600)
5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
Statistics
----------------------------------------------------------
210 recursive calls
0 db block gets
7145 consistent gets
0 physical reads
0 redo size
2638 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
50 rows processed
三、强制使用索引
1、对于方法1,使用索引t1_t_a
select
*
from
(
select
/*+INDEX(t1_t t1_t_a)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=844081 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=844081 Card=3632040 Bytes=192498120)
2 1 WINDOW (NOSORT) (Cost=844081 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (BY INDEX ROWID) OF 'T1_T' (TABLE) (Cost=815268 Card=3632040Bytes=101697120)
4 3 INDEX (FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=7154 Card=3632040)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
814027 consistent gets
0 physical reads
0 redo size
2817 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
相比没有使用索引时的效率还要低的多,因为是全表扫描,所以使用全索引扫描造成了额外的开销。
2、对于方法2:
将INDEX FAST FULL SCAN 改成INDEX FULL SCAN 明显没有什么意义。
四、a,b的联合索引:
首先创建a,b的联合索引:
createindex t1_t_ab on t1_t(a,b);
1、对方法1,使用索引t1_t_ab
select
*
from
(
select
/*+INDEX(t1_t t1_t_ab)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=48327 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=48327 Card=3632040 Bytes=192498120)
2 1 WINDOW (NOSORT) (Cost=48327 Card=3632040 Bytes=101697120)
3 2 INDEX (FULL SCAN) OF 'T1_T_AB' (INDEX) (Cost=19514 Card=3632040 Bytes=1016
97120)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
19368 consistent gets
0 physical reads
0 redo size
2783 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
相比没有使用索引的时候,性能有所降低,但是比使用t1_t_a索引要好的多了,但是如果对于a字段进行限制时,row_number() over函数会自动找到t1_t_ab索引,效果会很好
select
*
from
(
select
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_twhere a='1'
)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1119 Card=84004 Bytes=4452212)
1 0 VIEW (Cost=1119 Card=84004 Bytes=4452212)
2 1 WINDOW (NOSORT) (Cost=1119 Card=84004 Bytes=2352112)
3 2 INDEX (RANGE SCAN) OF 'T1_T_AB' (INDEX) (Cost=453 Card=84004 Bytes=2352112)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
410 consistent gets
0 physical reads
0 redo size
537 bytes sent via SQL*Net to client
504 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
五、总结
基本上如果是要在全表中找到所有a的单条记录,那么加不加索引的区别都不大,因为全表扫描是没有必要走索引的,但是用方法2会比方法1要快速很多,资源占用也较少,原因是不需要排序的环节。
如果只是想要取出单个a值的任意记录,则可以选用方法1,并建立a,b字段的联合索引,会大大提高效率。
-The End-