取重复记录中的任一条的分析
最近一直考虑一个问题,找出表中同一字段重复记录中的任意一条,照理来说这样的一个需求,在Oracle内部进行实现是很方便的,而且不需要对表进行2遍的扫描。但是事实上我想了很久也向不出来有什么函数可以直接实现这一功能,基本上所有可以这样做的方法都需要进行嵌套才能完成。不知道为什么Oracle没有提供这个功能,也许是有什么逻辑矛盾我没有想到。现在总结一下,模拟的环境如下:
create table t1_t(a int,b varchar2(50));
insert into t1_t(a,b) (select mod(rownum,50) rn,object_name from user_objects);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
insert into t1_t(a,b) (select * from t1_t);
commit;
至此t1_t表大致上有300万条记录左右,其中a字段的取值都是0-49之间的integer,现在的目标就是取到50条记录,a分别等于0~49,具体哪一条都可以,但需要同时取出b字段。
一、方法大致有两种:
1、使用row_number() over函数找到首条记录选出
select * from
(select a,b,row_number() over(partition by a order by 1) k from t1_t)
where k=1;
实际执行时间:在软解析的情况下需要3s左右。执行计划如下:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=32068 Card=3632040 Bytes=192498120)
2 1 WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
Statistics
----------------------------------------------------------
165 recursive calls
22 db block gets
17303 consistent gets
1937 physical reads
75904 redo size
2682 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
4 sorts (memory)
1 sorts (disk)
50 rows processed
2、两层关联,用rowid别名进行匹配:
select t1.* from t1_t t1,(select min(rowid) rid from t1_t group by a) t2
where t1.rowid = t2.rid;
实际执行时间:软解析情况下1.5s左右。执行计划如下,比上面的方法要好很多,主要是由于不需要进行排序的操作:
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3919 Card=50 Bytes=2000)
1 0 NESTED LOOPS (Cost=3919 Card=50 Bytes=2000)
2 1 VIEW (Cost=3869 Card=50 Bytes=600)
3 2 HASH (GROUP BY) (Cost=3869 Card=50 Bytes=750)
4 3 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3243 Card=3632040 Bytes=54480600)
5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
16278 consistent gets
0 physical reads
0 redo size
2638 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
二、考虑使用索引
首先增加a、b字段的索引:
create
index
t1_t_a
on
t1_t(a);
create
index
t1_t_b
on
t1_t(b);
要使用索引,需要将字段设置为not null,或在SQL中使用not null选项才可以:
alter
table
t1_t
modify
a
not
null
;
alter
table
t1_t
modify
b
not
null
;
1、对于第1种方法,执行计划完全没变,执行效率略有提高
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=32068 Card=3632040 Bytes=192498120)
2 1 WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
Statistics
----------------------------------------------------------
2 recursive calls
22 db block gets
16228 consistent gets
21 physical reads
0 redo size
2682 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
1 sorts (disk)
50 rows processed
2、对于第2中方法,用INDEX FAST FULL SCAN 代替了 TABLE ACCESS FULL SCAN,执行效率也略有提高
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2114 Card=50 Bytes=2000)
1 0 NESTED LOOPS (Cost=2114 Card=50 Bytes=2000)
2 1 VIEW (Cost=2064 Card=50 Bytes=600)
3 2 HASH (GROUP BY) (Cost=2064 Card=50 Bytes=750)
4 3 INDEX (FAST FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=1439 Card=3632040 Bytes=54480600)
5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
Statistics
----------------------------------------------------------
210 recursive calls
0 db block gets
7145 consistent gets
0 physical reads
0 redo size
2638 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
50 rows processed
三、强制使用索引
1、对于方法1,使用索引t1_t_a
select
*
from
(
select
/*+INDEX(t1_t t1_t_a)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=844081 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=844081 Card=3632040 Bytes=192498120)
2 1 WINDOW (NOSORT) (Cost=844081 Card=3632040 Bytes=101697120)
3 2 TABLE ACCESS (BY INDEX ROWID) OF 'T1_T' (TABLE) (Cost=815268 Card=3632040Bytes=101697120)
4 3 INDEX (FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=7154 Card=3632040)
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
814027 consistent gets
0 physical reads
0 redo size
2817 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
相比没有使用索引时的效率还要低的多,因为是全表扫描,所以使用全索引扫描造成了额外的开销。
2、对于方法2:
将INDEX FAST FULL SCAN 改成INDEX FULL SCAN 明显没有什么意义。
四、a,b的联合索引:
首先创建a,b的联合索引:
createindex t1_t_ab on t1_t(a,b);
1、对方法1,使用索引t1_t_ab
select
*
from
(
select
/*+INDEX(t1_t t1_t_ab)*/
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_t)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=48327 Card=3632040 Bytes=192498120)
1 0 VIEW (Cost=48327 Card=3632040 Bytes=192498120)
2 1 WINDOW (NOSORT) (Cost=48327 Card=3632040 Bytes=101697120)
3 2 INDEX (FULL SCAN) OF 'T1_T_AB' (INDEX) (Cost=19514 Card=3632040 Bytes=1016
97120)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
19368 consistent gets
0 physical reads
0 redo size
2783 bytes sent via SQL*Net to client
537 bytes received via SQL*Net from client
5 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
50 rows processed
相比没有使用索引的时候,性能有所降低,但是比使用t1_t_a索引要好的多了,但是如果对于a字段进行限制时,row_number() over函数会自动找到t1_t_ab索引,效果会很好
select
*
from
(
select
a,b,row_number() over(
partition
by
a
order
by
1
) k
from
t1_twhere a='1'
)
where
k=
1
;
Execution Plan
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1119 Card=84004 Bytes=4452212)
1 0 VIEW (Cost=1119 Card=84004 Bytes=4452212)
2 1 WINDOW (NOSORT) (Cost=1119 Card=84004 Bytes=2352112)
3 2 INDEX (RANGE SCAN) OF 'T1_T_AB' (INDEX) (Cost=453 Card=84004 Bytes=2352112)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
410 consistent gets
0 physical reads
0 redo size
537 bytes sent via SQL*Net to client
504 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
五、总结
基本上如果是要在全表中找到所有a的单条记录,那么加不加索引的区别都不大,因为全表扫描是没有必要走索引的,但是用方法2会比方法1要快速很多,资源占用也较少,原因是不需要排序的环节。
如果只是想要取出单个a值的任意记录,则可以选用方法1,并建立a,b字段的联合索引,会大大提高效率。