Decode360's Blog

业精于勤而荒于嬉 QQ:150355677 MSN:decode360@hotmail.com

  BlogJava :: 首页 :: 新随笔 :: 联系 ::  :: 管理 ::
  302 随笔 :: 26 文章 :: 82 评论 :: 0 Trackbacks
取重复记录中的任一条的分析
 
 
    最近一直考虑一个问题,找出表中同一字段重复记录中的任意一条,模拟的环境如下:
 
    create table t1_t(a int,b varchar2(50));
    insert into t1_t(a,b) (select mod(rownum,50) rn,object_name from user_objects);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);
    insert into t1_t(a,b) (select * from t1_t);
    commit;
 
    至此t1_t表大致上有300万条记录左右,其中a字段的取值都是0-49之间的integer,现在的目标就是取到50条记录,a分别等于0~49,具体哪一条都可以,但需要同时取出b字段。
 
 
一、方法大致有两种:
 
1、使用row_number() over函数找到首条记录选出
 
    select * from 
    (select a,b,row_number() over(partition by a order by 1) k from t1_t)
    where k=1;
 
    实际执行时间:在软解析的情况下需要3s左右。执行计划如下:
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=32068 Card=3632040 Bytes=192498120)
   2    1     WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
   3    2       TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
 
Statistics
----------------------------------------------------------
        165  recursive calls
         22  db block gets
      17303  consistent gets
       1937  physical reads
      75904  redo size
       2682  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          4  sorts (memory)
          1  sorts (disk)
         50  rows processed
 
2、两层关联,用rowid别名进行匹配:
 
    select t1.* from t1_t t1,(select min(rowid) rid from t1_t group by a) t2
    where t1.rowid = t2.rid;
 
    实际执行时间:软解析情况下1.5s左右。执行计划如下,比上面的方法要好很多,主要是由于不需要进行排序的操作:
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=3919 Card=50 Bytes=2000)
   1    0   NESTED LOOPS (Cost=3919 Card=50 Bytes=2000)
   2    1     VIEW (Cost=3869 Card=50 Bytes=600)
   3    2       HASH (GROUP BY) (Cost=3869 Card=50 Bytes=750)
   4    3         TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3243 Card=3632040 Bytes=54480600)
   5    1     TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)
 
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      16278  consistent gets
          0  physical reads
          0  redo size
       2638  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         50  rows processed
 
 
二、考虑使用索引
 
 
    首先增加a、b字段的索引:
 

    create index t1_t_a on t1_t(a);

    create index t1_t_b on t1_t(b);

   
    要使用索引,需要将字段设置为not null,或在SQL中使用not null选项才可以:
 

    alter table t1_t modify a not null ;

    alter table t1_t modify b not null ;

 
 
1、对于第1种方法,执行计划完全没变,执行效率略有提高
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=32068 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=32068 Card=3632040 Bytes=192498120)
   2    1     WINDOW (SORT PUSHED RANK) (Cost=32068 Card=3632040 Bytes=101697120)
   3    2       TABLE ACCESS (FULL) OF 'T1_T' (TABLE) (Cost=3255 Card=3632040 Bytes=101697120)
 
Statistics
----------------------------------------------------------
          2  recursive calls
         22  db block gets
      16228  consistent gets
         21  physical reads
          0  redo size
       2682  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          0  sorts (memory)
          1  sorts (disk)
         50  rows processed
 
2、对于第2中方法,用INDEX FAST FULL SCAN 代替了 TABLE ACCESS FULL SCAN,执行效率也略有提高
 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2114 Card=50 Bytes=2000)
   1    0   NESTED LOOPS (Cost=2114 Card=50 Bytes=2000)
   2    1     VIEW (Cost=2064 Card=50 Bytes=600)
   3    2       HASH (GROUP BY) (Cost=2064 Card=50 Bytes=750)
   4    3         INDEX (FAST FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=1439 Card=3632040 Bytes=54480600)
   5    1     TABLE ACCESS (BY USER ROWID) OF 'T1_T' (TABLE) (Cost=1 Card=1 Bytes=28)

Statistics
----------------------------------------------------------
        210  recursive calls
          0  db block gets
       7145  consistent gets
          0  physical reads
          0  redo size
       2638  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          5  sorts (memory)
          0  sorts (disk)
         50  rows processed
 
 
三、强制使用索引
 
1、对于方法1,使用索引t1_t_a
 

    select * from

    ( select /*+INDEX(t1_t t1_t_a)*/ a,b,row_number() over( partition by a order by 1 ) k from t1_t)

    where k= 1 ;


Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=844081 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=844081 Card=3632040 Bytes=192498120)
   2    1     WINDOW (NOSORT) (Cost=844081 Card=3632040 Bytes=101697120)
   3    2       TABLE ACCESS (BY INDEX ROWID) OF 'T1_T' (TABLE) (Cost=815268 Card=3632040Bytes=101697120)
 
   4    3         INDEX (FULL SCAN) OF 'T1_T_A' (INDEX) (Cost=7154 Card=3632040)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
     814027  consistent gets
          0  physical reads
          0  redo size
       2817  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         50  rows processed
 
    相比没有使用索引时的效率还要低的多,因为是全表扫描,所以使用全索引扫描造成了额外的开销。
 
2、对于方法2:
 
    将INDEX FAST FULL SCAN 改成INDEX FULL SCAN 明显没有什么意义。
 
 
四、a,b的联合索引:
 
    首先创建a,b的联合索引:
    createindex t1_t_ab on t1_t(a,b);
 
1、对方法1,使用索引t1_t_ab
 
 

    select * from

    ( select /*+INDEX(t1_t t1_t_ab)*/ a,b,row_number() over( partition by a order by 1 ) k from t1_t)

    where k= 1 ;

 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=48327 Card=3632040 Bytes=192498120)
   1    0   VIEW (Cost=48327 Card=3632040 Bytes=192498120)
   2    1     WINDOW (NOSORT) (Cost=48327 Card=3632040 Bytes=101697120)
   3    2       INDEX (FULL SCAN) OF 'T1_T_AB' (INDEX) (Cost=19514 Card=3632040 Bytes=1016
          97120)
 
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
      19368  consistent gets
          0  physical reads
          0  redo size
       2783  bytes sent via SQL*Net to client
        537  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         50  rows processed
 
    相比没有使用索引的时候,性能有所降低,但是比使用t1_t_a索引要好的多了,但是如果对于a字段进行限制时,row_number() over函数会自动找到t1_t_ab索引,效果会很好
 

    select * from

    ( select a,b,row_number() over( partition by a order by 1 ) k from t1_twhere a='1' )

    where k= 1 ;

 
Execution Plan
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1119 Card=84004 Bytes=4452212)
   1    0   VIEW (Cost=1119 Card=84004 Bytes=4452212)
   2    1     WINDOW (NOSORT) (Cost=1119 Card=84004 Bytes=2352112)
   3    2       INDEX (RANGE SCAN) OF 'T1_T_AB' (INDEX) (Cost=453 Card=84004 Bytes=2352112)
 
Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        410  consistent gets
          0  physical reads
          0  redo size
        537  bytes sent via SQL*Net to client
        504  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed
 
 
五、总结
 
    基本上如果是要在全表中找到所有a的单条记录,那么加不加索引的区别都不大,因为全表扫描是没有必要走索引的,但是用方法2会比方法1要快速很多,资源占用也较少,原因是不需要排序的环节。
 
    如果只是想要取出单个a值的任意记录,则可以选用方法1,并建立a,b字段的联合索引,会大大提高效率。
 
 




-The End-

posted on 2009-03-04 21:57 decode360-3 阅读(387) 评论(0)  编辑  收藏 所属分类: SQL Dev

只有注册用户登录后才能发表评论。


网站导航: