取重复记录中的任一条的分析

取重复记录中的任一条的分析

最近一直考虑一个问题，找出表中同一字段重复记录中的任意一条，模拟的环境如下：

create table t1_t(a int,b varchar2(50));

    insert into t1_t(a,b) (select mod(rownum,50) rn,object_name from user_objects);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);

commit;

至此t1_t表大致上有300万条记录左右，其中a字段的取值都是0-49之间的integer，现在的目标就是取到50条记录，a分别等于0~49，具体哪一条都可以，但需要同时取出b字段。

一、方法大致有两种：

1、使用row_number() over函数找到首条记录选出

    select * from
    (select a,b,row_number() over(partition by a order by 1) k from t1_t)
    where k=1;

实际执行时间：在软解析的情况下需要3s左右。执行计划如下：

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=32068 Card=3632040 Bytes=192498120)
   2    1     WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
   3    2       TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)

Statistics
----------------------------------------------------------
        165 recursive calls
         22 db block gets
      17303 consistent gets
       1937 physical reads
      75904 redo size
       2682 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          4 sorts (memory)
          1 sorts (disk)
         50 rows processed

2、两层关联，用rowid别名进行匹配：

select t1.* from t1_t t1,(select min(rowid) rid from t1_t group by a) t2
where t1.rowid = t2.rid;

实际执行时间：软解析情况下1.5s左右。执行计划如下，比上面的方法要好很多，主要是由于不需要进行排序的操作：

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3919 Card=50 Bytes=2000)
   1    0   NESTED LOOPS (Cost=3919 Card=50 Bytes=2000)
   2    1     VIEW (Cost=3869 Card=50 Bytes=600)
   3    2       HASH (GROUP BY) (Cost=3869 Card=50 Bytes=750)
   4    3         TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3243 Card=3632040 Bytes=54480600)

5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
      16278 consistent gets
          0 physical reads
          0 redo size
       2638 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
         50 rows processed

二、考虑使用索引

首先增加a、b字段的索引：

create index t1_t_a on t1_t(a);

create index t1_t_b on t1_t(b);

要使用索引，需要将字段设置为not null，或在SQL中使用not null选项才可以：

alter table t1_t modify a not null ;

alter table t1_t modify b not null ;

1、对于第1种方法，执行计划完全没变，执行效率略有提高

Statistics
----------------------------------------------------------
          2 recursive calls
         22 db block gets
      16228 consistent gets
         21 physical reads
          0 redo size
       2682 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          0 sorts (memory)
          1 sorts (disk)
         50 rows processed

2、对于第2中方法，用INDEX FAST FULL SCAN 代替了 TABLE ACCESS FULL SCAN，执行效率也略有提高

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2114 Card=50 Bytes=2000)
   1    0   NESTED LOOPS (Cost=2114 Card=50 Bytes=2000)
   2    1     VIEW (Cost=2064 Card=50 Bytes=600)
   3    2       HASH (GROUP BY) (Cost=2064 Card=50 Bytes=750)
   4    3         INDEX (FAST FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=1439 Card=3632040 Bytes=54480600)

5 1 TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)

Statistics
----------------------------------------------------------
        210 recursive calls
          0 db block gets
       7145 consistent gets
          0 physical reads
          0 redo size
       2638 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          5 sorts (memory)
          0 sorts (disk)
         50 rows processed

三、强制使用索引

1、对于方法1，使用索引t1_t_a

select * from

( select /*+INDEX(t1_t t1_t_a)*/ a,b,row_number() over( partition by a order by 1 ) k from t1_t)

where k= 1 ;

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=844081 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=844081 Card=3632040 Bytes=192498120)
   2    1     WINDOW (NOSORT) (Cost=844081 Card=3632040 Bytes=101697120)
   3    2       TABLE ACCESS (BY INDEX ROWID) OF 'T1_T' (TABLE) (Cost=815268 Card=3632040Bytes=101697120)

4 3 INDEX (FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=7154 Card=3632040)

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     814027 consistent gets
          0 physical reads
          0 redo size
       2817 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
         50 rows processed

相比没有使用索引时的效率还要低的多，因为是全表扫描，所以使用全索引扫描造成了额外的开销。

2、对于方法2：

将INDEX FAST FULL SCAN 改成INDEX FULL SCAN 明显没有什么意义。

四、a,b的联合索引：

首先创建a,b的联合索引：

createindex t1_t_ab on t1_t(a,b);

1、对方法1，使用索引t1_t_ab

select * from

( select /*+INDEX(t1_t t1_t_ab)*/ a,b,row_number() over( partition by a order by 1 ) k from t1_t)

where k= 1 ;

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=48327 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=48327 Card=3632040 Bytes=192498120)
   2    1     WINDOW (NOSORT) (Cost=48327 Card=3632040 Bytes=101697120)
   3    2       INDEX (FULL SCAN) OF 'T1_T_AB' (INDEX) (Cost=19514 Card=3632040 Bytes=1016
          97120)

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
      19368 consistent gets
          0 physical reads
          0 redo size
       2783 bytes sent via SQL*Net to client
        537 bytes received via SQL*Net from client
          5 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
         50 rows processed

相比没有使用索引的时候，性能有所降低，但是比使用t1_t_a索引要好的多了，但是如果对于a字段进行限制时，row_number() over函数会自动找到t1_t_ab索引，效果会很好

select * from

( select a,b,row_number() over( partition by a order by 1 ) k from t1_twhere a='1' )

where k= 1 ;

Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1119 Card=84004 Bytes=4452212)
   1    0   VIEW (Cost=1119 Card=84004 Bytes=4452212)
   2    1     WINDOW (NOSORT) (Cost=1119 Card=84004 Bytes=2352112)
   3    2       INDEX (RANGE SCAN) OF 'T1_T_AB' (INDEX) (Cost=453 Card=84004 Bytes=2352112)

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
        410 consistent gets
          0 physical reads
          0 redo size
        537 bytes sent via SQL*Net to client
        504 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

五、总结

基本上如果是要在全表中找到所有a的单条记录，那么加不加索引的区别都不大，因为全表扫描是没有必要走索引的，但是用方法2会比方法1要快速很多，资源占用也较少，原因是不需要排序的环节。

如果只是想要取出单个a值的任意记录，则可以选用方法1，并建立a,b字段的联合索引，会大大提高效率。

-The End-

posted on 2009-03-04 21:57 decode360-3 阅读(388) 评论(0) 编辑收藏所属分类: SQL Dev

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: [ITPub]最小源代码擂台 SQLPLUS - 关于Break On ratio_to_report函数关于MERGE操作的一些建议 SQLPlus { sqlterminator \| escape } Oracle的转义字符使用DBMS_SUPPORT包[转] 取重复记录中的任一条的分析 SQL语句优化方法30例[转] 关于并发问题的测试

Decode360's Blog

公告

常用链接

随笔分类(299)

随笔档案(299)

文章分类(26)

新闻分类(24)

收藏夹(5)

Blog列表

IT网站

My Link

最新随笔

最新评论